HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
13 stars 3 forks source link

Crashing when retrieving taxs from NCBI #120

Closed HobnobMancer closed 3 months ago

HobnobMancer commented 6 months ago

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

A clear and concise description of what the bug is. Please include what you are trying to get the tool to do?

cazy_webscraper crashes when receiving an incomplete read from NCBI, while downloading the latest taxonomy data for records with multiple taxa in CAZy.

To Reproduce

Please include the specific steps (including all code) you performed, so that we can check if the behaviour can be reproduced:

cazy_webscraper email -d database.db

Error:

Traceback (most recent call last):
  File ".....anaconda3/cw/bin/cazy_webscraper", line 8, in <module>
    sys.exit(main())
  File ".....anaconda3/cw/lib/python3.8/site-packages/cazy_webscraper/cazy_scraper.py", line 268, in main
    get_cazy_data(
  File ".....anaconda3/cw/lib/python3.8/site-packages/cazy_webscraper/cazy_scraper.py", line 378, in get_cazy_data
    cazy_data, successful_replacement = replace_multiple_tax(
  File ".....anaconda3/cw/lib/python3.8/site-packages/cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 170, in replace_multiple_tax
    cazy_data = get_ncbi_tax(epost_results, cazy_data, replaced_taxa_logger, args)
  File ".....anaconda3/cw/lib/python3.8/site-packages/cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 201, in get_ncbi_tax
    protein_records = Entrez.read(record_handle, validate=False)
  File ".....anaconda3/cw/lib/python3.8/site-packages/Bio/Entrez/__init__.py", line 518, in read
    record = handler.read(source)
  File ".....anaconda3/cw/lib/python3.8/site-packages/Bio/Entrez/Parser.py", line 403, in read
    self.parser.ParseFile(stream)
  File ".....anaconda3/cw/lib/python3.8/http/client.py", line 459, in read
    n = self.readinto(b)
  File ".....anaconda3/cw/lib/python3.8/http/client.py", line 493, in readinto
    return self._readinto_chunked(b)
  File ".....anaconda3/cw/lib/python3.8/http/client.py", line 604, in _readinto_chunked
    raise IncompleteRead(bytes(b[0:total_bytes]))
http.client.IncompleteRead: IncompleteRead(2004 bytes read)

Expected behavior

Catch incomplete read error and parse.

HobnobMancer commented 5 months ago

The issue is still persisting, with failing to parse incomplete XML files downloaded from NCBI during the retrieval of NCBI taxonomies --> see #124 and #125

HobnobMancer commented 5 months ago

An IncompleteRead and a CorruptedXMLError need to be added to the try/excepts on lines 201 and 204 in cazy_webscraper/ncbi/taxonomy/multiple_taxa.py

Dr-Doomhammer commented 4 months ago

Have this bug been fixed?

HobnobMancer commented 4 months ago

I haven't been able to replicate this error myself. As far as I can tell it should be fixed in the latest version (2.3.0.2, and on branch issue_120_ncbi). I haven't had another chance to look at this until today.

Later today version 2.3.0.3 will be released, with a try/except for incomplete and corrupt reads from NCBI on every call to NCBI to try and help with the problem. All cazy_webscraper will be able to do is retry connecting to NCBI, but if there are persistent incomplete/corrupted reads that is mostly likely an issue with connection closing prematurely, which is independent of cazy_webscraper.

HobnobMancer commented 4 months ago

This issue should now be addressed in version 2.3.0.3 - see release notes.

I'll leave this issue open for a while in case the issue persists.

HobnobMancer commented 3 months ago

As the issue seems to have been resolved I will close this issue. If the issue persists then please feel free to open this issue.