DominikBuchner / BOLDigger2

An even better Python program to query .fasta files against the COI database of www.boldsystems.org
MIT License
12 stars 0 forks source link

No xlsx file written - issue with generating download links for additional data? #19

Closed naurasd closed 3 months ago

naurasd commented 4 months ago

Hi Dominik,

having the issue currently that no xlsx file is being written for my results.

Using boldigger2 v1.0.6 with python 3.12.1.

I have a fasta file with 1,773 COI sequences. After roughly 4.5 hours it says download links for additional data are being generated. Then nothing happens for the next 5 days until my job is being terminated due to time out.

All files attached as txt files:

Fasta file:

COI_cluster_reps_lulu_curated.txt

h5.lz files:

COI_cluster_reps_lulu_curated_top_100_hits.h5.txt COI_cluster_reps_lulu_curated_download_links.h5.txt

Job error and output files:

digger_tilde_err.txt digger_tilde_out.txt

Problem from BOLD's side?

Best,

Nauras

naurasd commented 4 months ago

may have been fixed with the most recent release?

DominikBuchner commented 4 months ago

May have yes. Please retry this file. I'm again working on a major update because BOLD just limits access to almost everything at the moment. But have to do some intense testing before. At least API access has been limited to 3 requests per minute, which is insane :( However, I'm actively working on it, so I hope this will be fixed soon. This effects all BOLDigger version unfortunately, not just 2.

naurasd commented 4 months ago

alright no prob. will try again with new release 1.3.4 and update you.

DominikBuchner commented 4 months ago

Fixed with 2.0.0.

naurasd commented 3 months ago

Hi Dominik,

Might need to be reopened. Not sure if this is fixed with 2.0.4.

The end of my .err file look like this:

Generating download links:  99%|█████████▉| 1482/1500 [24:53:25<05:23, 17.97s/it
Generating download links:  99%|█████████▉| 1482/1500 [24:55:18<05:23, 17.97s/it
Downloading data:  99%|█████████▉| 1482/1500 [24:55:18<05:23, 17.97s/it]
Generating download links:  99%|█████████▉| 1492/1500 [24:55:47<02:14, 16.83s/it
Generating download links:  99%|█████████▉| 1492/1500 [24:57:57<02:14, 16.83s/it
Downloading data:  99%|█████████▉| 1492/1500 [24:57:57<02:14, 16.83s/it]
Generating download links: : 1502it [24:58:23, 59.86s/it]
Downloading additional data:   1%|          | 2/177 [15:20:37<58:27, 20.04s/it][

The end of my .out file looks like this:

22:27:29: Downloaded top 100 hits of all records for ASV6501
22:27:31: Downloaded top 100 hits of all records for ASV6482
22:27:33: Downloaded top 100 hits of all records for ASV6495
22:27:35: Downloaded top 100 hits of all records for ASV6493
22:27:37: Downloaded top 100 hits of all records for ASV6485
22:27:39: Downloaded top 100 hits of all records for ASV6492
22:27:41: Downloaded top 100 hits of all records for ASV6500
22:27:41: All records top 100 records successfully downloaded.
22:27:41: Ordering top 100 hits.
22:27:45: Generating download links for additional data.

So the last action happened last night 22:27. This was the last time the .h5.lz file was appended. Since then (so for the past 15 hours) nothing has happened. It's difficult for me to understand the time stamps in the .err file. The last time stamp of 15:20:37 is basically the time used since the start of the generation of download links for additional data last night at 22:27. Also not sure what the 2/177 refers to.

Let me know if you need any of the other files to check up on this!

Best nauras

naurasd commented 3 months ago

Hi Dominik,

just adding some comments as an update after our exchange earlier today:

  1. This step works on a personal computer with 2.0.7 (python 3.12.5).
  2. However, it seems as if this step cannot be terminated and then continued where left off. Seems like the step starts from the beginning again after termination and re-start. Not 100% sure about this though.
  3. The initial problem described above (i.e., that this step does not work at all and the program freezes) may occur on clusters due to firewall issues when requesting API.

Best Nauras

naurasd commented 3 months ago

Update for running on personal computer:

Time out error, but this should be an issue from BOLD's side.

20:14:13: Trying to log in.
20:14:15: Login successful.
20:14:20: Starting to download from the species level database.
20:14:20: Starting to download from the all records database.
20:14:20: Performing second login for requesting links from the all records database.
20:14:20: Trying to log in.
20:14:22: Login successful.
20:14:24: Starting to gather download links from the all records database.
20:14:25: All records top 100 records successfully downloaded.
20:14:25: Ordering top 100 hits.
20:14:27: Generating download links for additional data.
Downloading additional data:  25%|███████████▋                                   | 199/803 [1:06:38<3:22:14, 20.09s/it]
Traceback (most recent call last):
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 444, in _error_catcher
    yield
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 831, in read_chunked
    chunk = self._handle_chunk(amt)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 784, in _handle_chunk
    returned_chunk = self._fp._safe_read(self.chunk_left)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 640, in _safe_read
    data = self.fp.read(amt)
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\socket.py", line 720, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\requests\models.py", line 820, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 816, in read_chunked
    with self._error_catcher():
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 449, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='www.boldsystems.org', port=80): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Scripts\boldigger2.exe\__main__.py", line 7, in <module>
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\boldigger2\__main__.py", line 88, in main
    id_engine_coi.main(
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\boldigger2\id_engine_coi.py", line 673, in main
    additional_data_download.main(fasta_path, hdf_name_top_100_hits, read_fasta)
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\boldigger2\additional_data_download.py", line 342, in main
    additional_data = asyncio.run(
                      ^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\boldigger2\additional_data_download.py", line 225, in as_session
    return await tqdm_asyncio.gather(*tasks, desc="Downloading additional data")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\tqdm\asyncio.py", line 79, in gather
    res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
           ^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\asyncio\tasks.py", line 631, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\tqdm\asyncio.py", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\boldigger2\additional_data_download.py", line 201, in limit_concurrency
    return await as_request(url, as_session)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\boldigger2\additional_data_download.py", line 178, in as_request
    response = await as_session.get(url, timeout=60)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\requests\sessions.py", line 746, in send
    r.content
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\requests\models.py", line 902, in content
    self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nauras\AppData\Local\Programs\Python\Python312\Lib\site-packages\requests\models.py", line 826, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.boldsystems.org', port=80): Read timed out.
DominikBuchner commented 3 months ago

I'll write a fix for that, but it may take until September. Have to finish updating my metabarcoding pipeline first.

DominikBuchner commented 3 months ago

Fixed with 2.1.0

naurasd commented 3 months ago

Sorry, still occurs with 2.1.0..

.err file:

Downloading additional data:  42%|████▏     | 1927/4539 [1:57:35<2:39:24,  3.66s/it]
Traceback (most recent call last):
  File "/home/naurasd/.local/lib/python3.12/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/home/naurasd/.local/lib/python3.12/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/home/naurasd/.local/lib/python3.12/site-packages/urllib3/response.py", line 758, in _update_chunk_length
    line = self._fp.fp.readline()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/sw/comp/python/3.12.1/rackham/lib/python3.12/socket.py", line 707, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/naurasd/.local/lib/python3.12/site-packages/requests/models.py", line 820, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/home/naurasd/.local/lib/python3.12/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/home/naurasd/.local/lib/python3.12/site-packages/urllib3/response.py", line 816, in read_chunked
    with self._error_catcher():
  File "/sw/comp/python/3.12.1/rackham/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/home/naurasd/.local/lib/python3.12/site-packages/urllib3/response.py", line 449, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='8.219.97.248', port=80): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/naurasd/.local/bin/boldigger2", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/__main__.py", line 88, in main
    id_engine_coi.main(
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 673, in main
    additional_data_download.main(fasta_path, hdf_name_top_100_hits, read_fasta)
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/additional_data_download.py", line 431, in main
    download_data(process_ids_to_download, hdf_name_top_100_hits)
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/additional_data_download.py", line 285, in download_data
    response = session.get(
               ^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/requests/sessions.py", line 746, in send
    r.content
  File "/home/naurasd/.local/lib/python3.12/site-packages/requests/models.py", line 902, in content
    self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/requests/models.py", line 826, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='8.219.97.248', port=80): Read timed out.

End of .out file:

12:23:05: API overloaded. Switching proxy.
12:23:43: Proxy set to http://35.185.196.38:3128.
12:23:44: API overloaded. Switching proxy.
12:23:49: Proxy set to http://35.185.196.38:3128.
12:23:49: API overloaded. Switching proxy.
12:24:28: Proxy set to http://69.197.135.43:18080.
12:24:28: API overloaded. Switching proxy.
12:25:08: Proxy set to http://8.219.97.248:80.
DominikBuchner commented 3 months ago

The ConnectionError should be handled... I'll look into this again, maybe it's also the TimeOut.

DominikBuchner commented 3 months ago

Updated to 2.1.3, Not the ReadTimeout is also handled correctly. Please note that you can now restart the additional data download and it will continue where it left off.

naurasd commented 3 months ago

yes i saw that. so cool that this is possible now!

thanks so much!