DominikBuchner / BOLDigger2

An even better Python program to query .fasta files against the COI database of www.boldsystems.org
MIT License
11 stars 0 forks source link

in _parse_tables: ValueError: No tables found #15

Closed naurasd closed 2 months ago

naurasd commented 3 months ago

Hi Dominik,

having a bit of trouble classifying ASVs in the following file: COI_cluster_reps_lulu_curated.txt

Boldigger2 is running fine for a while until this error happens (shown are the last lines of the error output file, including the entire error part):

Downloading top 100 hits:  77%|███████▋  | 1321/1723 [5:10:45<3:08:34, 37.16s/it]
Down
Downloading top 100 hits
                             %|███████▋  | 1326/1723 [5:11:37<1:27:54, 13.29s/it]
Downloading top 100 hits:  77%|███████▋  | 1327/1723 [5:12:01<1:13:22, 11.12s/it]
Downloading top 100 hits:  77%|███████▋  | 1329/1723 [5:12:21<1:32:36, 14.10s/it]
Traceback (most recent call last):
  File "/home/naurasd/.local/bin/boldigger2", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/__main__.py", line 87, in main
    id_engine_coi.main(
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 544, in main
    asyncio.run(
  File "/sw/comp/python3/3.12.1/rackham/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/sw/comp/python3/3.12.1/rackham/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sw/comp/python3/3.12.1/rackham/lib/python3.12/asyncio/base_events.py", line 684, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 385, in as_session
    return await tqdm_asyncio.gather(*tasks, desc="Downloading top 100 hits")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/tqdm/asyncio.py", line 79, in gather
    res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
           ^^^^^^^
  File "/sw/comp/python3/3.12.1/rackham/lib/python3.12/asyncio/tasks.py", line 631, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 351, in limit_concurrency
    return await as_request(
           ^^^^^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 226, in as_request
    response_table = pd.read_html(
                     ^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 1240, in read_html
    return _parse(
           ^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 1003, in _parse
    raise retained
  File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 983, in _parse
    tables = p.parse_tables()
             ^^^^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 249, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 598, in _parse_tables
    raise ValueError("No tables found")
ValueError: No tables found

Here are the last 4 line sof the output file:

07:22:38: Downloaded top 100 species level records for ASV3538
07:22:44: Downloaded top 100 species level records for ASV1775
07:23:01: Downloaded top 100 species level records for ASV1232
07:23:21: Downloaded top 100 species level records for ASV1099

I really have no clue what is going on here. Any help appreciated.

Thanks a salways for your hard work!

Nauras

DominikBuchner commented 3 months ago

Boldigger2 trys to fetch the top 100 table from the HTML that is returned by bold. Sometimes there is (so far I don't get the reason) no table, even though parsable HTML is returned. I will look into this, would love to know for which sequence this happens, however since they are random it is kind of hard to find out. I will have to write a temporary fix that just writes the otu und HTML into a file to find out what is going on there. I think restarting solve the issue, making it even stranger, since it does not seem to be a problem on bolds end nor with the code, but just random behavior.... Can you send me your download links file, so I have a starting point?

naurasd @.***> schrieb am Do., 4. Juli 2024, 00:44:

Hi Dominik,

having a bit of trouble classifying ASVs in the following file: COI_cluster_reps_lulu_curated.txt https://github.com/user-attachments/files/16091371/COI_cluster_reps_lulu_curated.txt

Boldigger2 is running fine for a while until this error happens (shown are the last lines of the error output file, including the entire error part):

Downloading top 100 hits: 77%|███████▋ | 1321/1723 [5:10:45<3:08:34, 37.16s/it] Down Downloading top 100 hits %|███████▋ | 1326/1723 [5:11:37<1:27:54, 13.29s/it] Downloading top 100 hits: 77%|███████▋ | 1327/1723 [5:12:01<1:13:22, 11.12s/it] Downloading top 100 hits: 77%|███████▋ | 1329/1723 [5:12:21<1:32:36, 14.10s/it] Traceback (most recent call last): File "/home/naurasd/.local/bin/boldigger2", line 8, in sys.exit(main()) ^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/main.py", line 87, in main id_engine_coi.main( File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 544, in main asyncio.run( File "/sw/comp/python3/3.12.1/rackham/lib/python3.12/asyncio/runners.py", line 194, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/sw/comp/python3/3.12.1/rackham/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sw/comp/python3/3.12.1/rackham/lib/python3.12/asyncio/base_events.py", line 684, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 385, in as_session return await tqdm_asyncio.gather(*tasks, desc="Downloading top 100 hits") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/tqdm/asyncio.py", line 79, in gather res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout, ^^^^^^^ File "/sw/comp/python3/3.12.1/rackham/lib/python3.12/asyncio/tasks.py", line 631, in _wait_for_one return f.result() # May raise f.exception(). ^^^^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable return i, await f ^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 351, in limit_concurrency return await as_request( ^^^^^^^^^^^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/boldigger2/id_engine_coi.py", line 226, in as_request response_table = pd.read_html( ^^^^^^^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 1240, in read_html return _parse( ^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 1003, in _parse raise retained File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 983, in _parse tables = p.parse_tables() ^^^^^^^^^^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 249, in parse_tables tables = self._parse_tables(self._build_doc(), self.match, self.attrs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/naurasd/.local/lib/python3.12/site-packages/pandas/io/html.py", line 598, in _parse_tables raise ValueError("No tables found") ValueError: No tables found

Here are the last 4 line sof the output file:

07:22:38: Downloaded top 100 species level records for ASV3538 07:22:44: Downloaded top 100 species level records for ASV1775 07:23:01: Downloaded top 100 species level records for ASV1232 07:23:21: Downloaded top 100 species level records for ASV1099

I really have no clue what is going on here. Any help appreciated.

Thanks a salways for your hard work!

Nauras

— Reply to this email directly, view it on GitHub https://github.com/DominikBuchner/BOLDigger2/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJH6ILAFBJ2RFDRS4DMRZHDZKR5DDAVCNFSM6AAAAABKKNRUXWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4DSNRRGE3DMMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

DominikBuchner commented 3 months ago

To give a more concise answer, since I'm fully awake now: This is a tough nut to crack so this will take a bit more time for me. It appears that there are different reasons for this no tables found error so I have a few options:

  1. I could just retry if there is no table to be found --> might end up looping infinitely if there actually is no table, and this was sometimes the case for boldigger-cline, which you might remember. I'd like to avoid this infinite loops as best as possible.
  2. I could skip the sequence --> if your internet connection / the BOLD server breaks down, this will lead to data loss

So: Both "quick and dirty" fixes are too dangerous to just be applied. I'll have to go search for the actual cause, to tailor a solution that just captures this exception and nothing else. Since the error seems to appear at random this naturally takes some time. But I'll get there with the help of the users :)

TLDR: For your data, simply restarting may very well fix the issue, while I'm looking for a solution.

naurasd commented 3 months ago

thanks for the reply in the early morning hours ;-)

have sent you the download links files via email.

Nauras

DominikBuchner commented 2 months ago

Fixed with 2.0.0

naurasd commented 2 months ago

Unfortunately, this is not fixed with 2.0.1 (and python/3.12.1).

All files attached as txt files for you to reproduce the issue:

Fasta file:

COI_cluster_reps_lulu_curated.txt

h5.lz file:

sent to you via wetransfer, file size too large

Job error and output files:

digger_error.txt digger_output.txt

DominikBuchner commented 2 months ago

So it's still the additional data download that simply fails?

naurasd @.***> schrieb am Mi., 31. Juli 2024, 17:51:

Unfortunately, this is not fixed with 2.0.1 (and python/3.12.1).

All files attached as txt files for you to reproduce the issue:

Fasta file:

COI_cluster_reps_lulu_curated.txt https://github.com/user-attachments/files/16444233/COI_cluster_reps_lulu_curated.txt

h5.lz file:

sent to you via wetransfer, file size too large

Job error and output files:

digger_error.txt https://github.com/user-attachments/files/16444245/digger_error.txt digger_output.txt https://github.com/user-attachments/files/16444249/digger_output.txt

— Reply to this email directly, view it on GitHub https://github.com/DominikBuchner/BOLDigger2/issues/15#issuecomment-2260843080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJH6ILBVNVRE4FZYER3V5NDZPEBZNAVCNFSM6AAAAABKKNRUXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRQHA2DGMBYGA . You are receiving this because you modified the open/close state.Message ID: @.***>

DominikBuchner commented 2 months ago

Fixed with 2.0.3

naurasd commented 2 months ago

amazing, thanks!