Closed brendanwee closed 3 years ago
@brendanwee thanks for your reporting. We did see increased errors in recent days and are currently investigating the cause.
@brendanwee may I ask what's the approximate number of gene_ids being queried per second?
@brendanwee meanwhile, we have made slight changes to the server cluster to better buffer request bursts, please consider giving it another try. I would recommend gradually increasing the parallel job numbers to avoid any throttling effect.
@namespacestd0 I was using the python client MyGeneInfo.querymany()
method in each job. There were probably around 100-200 jobs running at once. it seems that querymany
batches queries into 1000 at a time. Each job is querying about 52,000 gene ids and finishes querying in about 2 minutes. So roughly 430 id/s * 200 jobs ~= 86,000 queries per second.
Sounds good, I will try and run the jobs again and see how it turns out
@namespacestd0 The exact same error occurred and this time more frequently. ~45% of our jobs failed with this error. Did you apply some kind of fix? Can you confirm if this is a throttling written in the code or something else?
thanks for the update. yeah that level of sustained traffic is surely beyond our server capacity, and beyond the speed of our scaling architecture. I also do see throttling effect on our server side earlier, so I assume you could not complete all jobs. we are definitely open to implementing task queue systems in the future but I recommend slowing down the request rate for now.
Ok, thank you for your quick replies. I appreciate you looking into this
This error is correctly identified as a 5xx error now. https://github.com/biothings/biothings.api/blob/32fad3510023d80700e552c80838c9ac775c3b00/biothings/web/pipeline/execute.py#L69 Additional capacity optimizations will follow.
Hello,
We have a RNAseq analysis pipeline hosted on AWS where we pump hundreds of RNAseq samples through an alignment + gene counting pipeline. At the end of this pipeline we use MyGene to generate gene symbols for all the genes that have counts. This worked well during testing, but once deployed into production I started a multiple runs, each containing hundreds of samples being run at the same time. This likely totaled to millions of gene_ids being queried through mygene resulting in the following error:
gene_symbol_queries = mg.querymany(stats_df["Geneid"], "ensembl.gene", fields="symbol", returnall=False, as_dataframe=True) File "/usr/local/lib/python3.6/site-packages/biothings_client/base.py", line 542, in _querymany for hits in self._repeated_query(query_fn, qterms, verbose=verbose): File "/usr/local/lib/python3.6/site-packages/biothings_client/base.py", line 223, in _repeated_query from_cache, query_result = query_fn(batch, fn_kwargs) File "/usr/local/lib/python3.6/site-packages/biothings_client/base.py", line 541, in query_fn def query_fn(qterms): return self._querymany_inner(qterms, verbose=verbose, kwargs) File "/usr/local/lib/python3.6/site-packages/biothings_client/base.py", line 488, in _querymany_inner return self._post(_url, params=_kwargs, verbose=verbose) File "/usr/local/lib/python3.6/site-packages/biothings_client/base.py", line 176, in _post res.raise_for_status() File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 400 Client Error: search_phase_execution_exception for url: http://mygene.info/v3/query/
This error is raised for about 30% of all our jobs. (1108 Succeeded, 605 failed). Is the high traffic causing this error? Do you have any advice for a way to get around this issue?