DominikBuchner / BOLDigger-commandline

BOLDigger as a commandline tool
MIT License
8 stars 0 forks source link

API verification - Max entries exceeded #9

Closed OndroV closed 1 year ago

OndroV commented 1 year ago

Hi Dominik,

it seems there is a limit to how many sequences can be handled in API verification at once. I'm getting the error below with a fasta of only 349 COI sequences. Have you experienced this? I tried splitting the input files in halves manually and it worked. But I guess that's not feasible for larger datasets. Do we need to do the verification in batches?

Best, Ondrej

`11:27:08: Starting API verification. 11:27:08: Collection OTUs without species level identification and high similarity. 11:27:08: Starting to query the API. Calling API: 84%|████████████████████████████████████████████████████████▏ | 31/37 [03:45<00:43, 7.26s/it] joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn conn = connection.create_connection( File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection raise err File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen httplib_response = self._make_request( File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 392, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/ono/miniconda3/lib/python3.8/http/client.py", line 1255, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/ono/miniconda3/lib/python3.8/http/client.py", line 1301, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/ono/miniconda3/lib/python3.8/http/client.py", line 1250, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/ono/miniconda3/lib/python3.8/http/client.py", line 1010, in _send_output self.send(msg) File "/home/ono/miniconda3/lib/python3.8/http/client.py", line 950, in send self.connect() File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/connection.py", line 187, in connect conn = self._new_conn() File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f0e960dcdf0>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ono/miniconda3/lib/python3.8/site-packages/requests/adapters.py", line 440, in send resp = conn.urlopen( File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen retries = retries.increment( File "/home/ono/miniconda3/lib/python3.8/site-packages/urllib3/util/retry.py", line 446, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='boldsystems.org', port=80): Max retries exceeded with url: /index.php/Ids_xml?db=COX1_SPECIES_PUBLIC&sequence=ACTTAATAATATAAGATTTTGATTATTACCACCCTCGATTATATTACTTATAATAAGTTCCATAGTTGAATTAGGGGCAGGAACAGGTTGAACTGTTTATCCCCCTCTATCAAGAAATATCGCTCATGCAGGACCAAGAGTTGATATAGCAATCTTCTCATTACATTTAGCTGGAATTTCTTCAATTCTAGGCGCCGTAAACTTTATTACAACTGTAATAAATATACGACCAACAGGAATAAGTATA (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0e960dcdf0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ono/miniconda3/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker r = call_item() File "/home/ono/miniconda3/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 288, in call return self.fn(*self.args, self.kwargs) File "/home/ono/miniconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in call return self.func(*args, *kwargs) File "/home/ono/miniconda3/lib/python3.8/site-packages/joblib/parallel.py", line 262, in call return [func(args, kwargs) File "/home/ono/miniconda3/lib/python3.8/site-packages/joblib/parallel.py", line 262, in return [func(*args, kwargs) File "/home/ono/miniconda3/lib/python3.8/site-packages/boldigger/api_verification.py", line 49, in request r = session.get('http://boldsystems.org/index.php/Ids_xml?db=COX1_SPECIES_PUBLIC&sequence={}'.format(item[1])) File "/home/ono/miniconda3/lib/python3.8/site-packages/requests/sessions.py", line 542, in get return self.request('GET', url, kwargs) File "/home/ono/miniconda3/lib/python3.8/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, send_kwargs) File "/home/ono/miniconda3/lib/python3.8/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, kwargs) File "/home/ono/miniconda3/lib/python3.8/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='boldsystems.org', port=80): Max retries exceeded with url: /index.php/Ids_xml?db=COX1_SPECIES_PUBLIC&sequence=ACTTAATAATATAAGATTTTGATTATTACCACCCTCGATTATATTACTTATAATAAGTTCCATAGTTGAATTAGGGGCAGGAACAGGTTGAACTGTTTATCCCCCTCTATCAAGAAATATCGCTCATGCAGGACCAAGAGTTGATATAGCAATCTTCTCATTACATTTAGCTGGAATTTCTTCAATTCTAGGCGCCGTAAACTTTATTACAACTGTAATAAATATACGACCAACAGGAATAAGTATA (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0e960dcdf0>: Failed to establish a new connection: [Errno 110] Connection timed out')) """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ono/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ono/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ono/miniconda3/lib/python3.8/site-packages/boldigger_cline/main.py", line 77, in main() File "/home/ono/miniconda3/lib/python3.8/site-packages/boldigger_cline/main.py", line 73, in main api_verification.main(args.xlsx_path, args.fasta_path) File "/home/ono/miniconda3/lib/python3.8/site-packages/boldigger_cline/api_verification.py", line 25, in main result = Parallel(n_jobs = psutil.cpu_count())(delayed(request)(item, session) for item in list(seq_dict.items())) File "/home/ono/miniconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1056, in call self.retrieve() File "/home/ono/miniconda3/lib/python3.8/site-packages/joblib/parallel.py", line 935, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/ono/miniconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/home/ono/miniconda3/lib/python3.8/concurrent/futures/_base.py", line 439, in result return self.get_result() File "/home/ono/miniconda3/lib/python3.8/concurrent/futures/_base.py", line 388, in get_result raise self._exception requests.exceptions.ConnectionError: None: Max retries exceeded with url: /index.php/Ids_xml?db=COX1_SPECIES_PUBLIC&sequence=ACTTAATAATATAAGATTTTGATTATTACCACCCTCGATTATATTACTTATAATAAGTTCCATAGTTGAATTAGGGGCAGGAACAGGTTGAACTGTTTATCCCCCTCTATCAAGAAATATCGCTCATGCAGGACCAAGAGTTGATATAGCAATCTTCTCATTACATTTAGCTGGAATTTCTTCAATTCTAGGCGCCGTAAACTTTATTACAACTGTAATAAATATACGACCAACAGGAATAAGTATA (Caused by None)`

OndroV commented 1 year ago

Hm, perhaps it has nothing to do with the amount of sequences: I just tried 5 other fastas containing only 100 sequences and all got the error, while 6th fasta containing 134 sequences worked.

Interestingly, the example from yesterday reached 31/37 iterations before stopping with error. Then I split the file and reached 11/11 and 27/27 without error. Today I split a different set of 634 sequences by 100 already before running ie_coi. For 3 of the files I noticed that their API verification went smoothly to a certain iteration ( 5/6 , 6/14, 16/18 ) , then hanged for quite a while and suddenly jumped to 100% with the error. For two fastas I only found the error at 19/19 and 14/14 and the 6th fasta reached 9/9 without error.

So I wonder if it can be some specific entries causing the error? But why would just splitting the file in halves solve the problem yesterday?

OndroV commented 1 year ago

Ah, now I see that re-running the same files leads to stopping at different iterations than previously or in a few lucky cases the job was finished successfully. Not a real issue then, sorry for the mess :) So perhaps it's a BOLD bug and we just need to try multiple times until it finally works?

In that case, a nice improvement in future BOLDigger versions could be re-starting the process until it succeeds (and maybe splitting into batches to proceed step by step rather than losing all the progress in the crash). As a quick fix I implemented a while loop checking md5sums into the command I use for boldigger in ubuntu: for f in my.fasta; do f=$(echo $f | cut -d. -f1); python -m boldigger_cline ie_coi username password ${f}.fasta . 5 ; python -m boldigger_cline add_metadata BOLDResults_${f}_part_1.xlsx; python -m boldigger_cline digger_hit BOLDResults_${f}_part_1.xlsx; m=$(md5sum BOLDResults_${f}_part_1.xlsx); while [ "$m" == "$(md5sum BOLDResults_${f}_part_1.xlsx)" ]; do python -m boldigger_cline api_verification BOLDResults_${f}_part_1.xlsx ${f}_done.fasta; done; done