HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
12 stars 3 forks source link

Crashing when retrieving taxs from NCBI - perhaps related to #120 #124

Closed bharat1912 closed 6 months ago

bharat1912 commented 6 months ago

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

## To Reproduce Please include the specific steps (including all code) you performed, so that we can check if the behaviour can be reproduced: Install pre-req and activate: $mamba create -n cazomevolve python=3.9 $mamba activate cazomevolve

Install cazoevolve from github repository (with pip) $git clone https://github.com/HobnobMancer/cazomevolve.git $cd cazomevolve $python3 -m pip install cazomevolve/.

$cazomevolve --version 0.1.7.3

Install dbcan: $mamba install -c conda-forge dbcan

Download CAZy database with cazomevolve activated: (cazomevolve) bharat@bharat-Precision-Tower-7810:~$ cazy_webscraper -o /media/bharat/volume2/db/cazy_db/ Using default CAZy class synonyms Built output directory: /media/bharat/volume2/db Built new local CAZyme database at /media/bharat/volume2/db/cazy_db Built output directory: /media/bharat/volume2/db/.cazy_webscraper_2024-02-25_20-55-10 [WARNING] [cazy_webscraper.cazy_scraper]: Created cache dir: /media/bharat/volume2/db/.cazy_webscraper_2024-02-25_20-55-10 Downloading CAZy txt file: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40626574/40626574 [01:13<00:00, 553302.84it/s] Parsing CAZy txt file: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4445596/4445596 [10:49<00:00, 6846.22it/s] Searching for multiple taxa annotations: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3477359/3477359 [00:15<00:00, 229819.30it/s] Batch retrieving tax info from NCBI. Batch size:200: 0%| | 0/267 [00:00<?, ?it/sGenBank accession AAB28815.1 retrieved from NCBI, but it is not present in CAZy | 0/199 [00:00<?, ?it/s] GenBank accession AAA35470.1 retrieved from NCBI, but it is not present in CAZy GenBank accession M83801.1 retrieved from NCBI, but it is not present in CAZy GenBank accession AAB26309.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA78311.1 retrieved from NCBI, but it is not present in CAZy ██████████████████...................................................................................................... ..................................................................................................................................................... Retrieving organism from NCBI: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 195/195 [00:00<00:00, 21455.08it/s] Batch retrieving tax info from NCBI. Batch size:200: 4%|████▋ | 11/267 [01:03<24:48, 5.82s/it] Traceback (most recent call last): File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/http/client.py", line 560, in _get_chunk_left chunk_left = self._read_next_chunk_size() File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/http/client.py", line 527, in _read_next_chunk_size return int(line, 16) ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/http/client.py", line 592, in _readinto_chunked chunk_left = self._get_chunk_left() File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/http/client.py", line 562, in _get_chunk_left raise IncompleteRead(b'') http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/bharat/mambaforge/envs/cazomevolve/bin/cazy_webscraper", line 8, in sys.exit(main()) File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/site-packages/cazy_webscraper/cazy_scraper.py", line 268, in main get_cazy_data( File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/site-packages/cazy_webscraper/cazy_scraper.py", line 378, in get_cazy_data cazy_data, successful_replacement = replace_multiple_tax( File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/site-packages/cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 170, in replace_multiple_tax cazy_data = get_ncbi_tax(epost_results, cazy_data, replaced_taxa_logger, args) File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/site-packages/cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 201, in get_ncbi_tax protein_records = Entrez.read(record_handle, validate=False) File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/site-packages/Bio/Entrez/init.py", line 503, in read record = handler.read(handle) File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/site-packages/Bio/Entrez/Parser.py", line 392, in read self.parser.ParseFile(handle) File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/http/client.py", line 463, in read n = self.readinto(b) File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/http/client.py", line 497, in readinto return self._readinto_chunked(b) File "/home/bharat/mambaforge/envs/cazomevolve/lib/python3.9/http/client.py", line 608, in _readinto_chunked raise IncompleteRead(bytes(b[0:total_bytes])) http.client.IncompleteRead: IncompleteRead(441 bytes read)

Describe the bug

CAzy datase fails to download after 4% download. Error, above

Expected behavior

Expected the database to be downloaded

Screenshots

Part of the download and the complete error reproduced above

Setup

Please provide a brief summary of your setup/computer you are using. For example:

Desktop (please complete the following information):

Smartphone (please complete the following information): Not used.

Additional context

Nil

HobnobMancer commented 6 months ago

Hi!

Thanks for using cazy_webscraper - sorry it's not working at the moment.

This issue is a duplicate of #120 and #125 - these are all related to parsing incomplete XML files from NCBI. This is typically the result of an interrupted connection to NCBI when downloading the XML.

I will close this issue, while reopening and continue work on #120.

This shouldn't take long to fix so please bear with!