HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
13 stars 3 forks source link

cazy_webscraper - error downloading database #125

Closed bharat1912 closed 5 months ago

bharat1912 commented 5 months ago

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

Error downloading database

To Reproduce

Please include the specific steps (including all code) you performed, so that we can check if the behaviour can be reproduced: Install: $mamba create -n cazy_webscraper -c conda-forge -c bioconda python=3.8 $conda activate cazy_webscraper $mamba install -c conda-forge cazy_webscraper

$cazy_webscraper --version =====================cazy_webscraper Version Information===================== cazy_webscraper version: cazy_webscraper version: 2.3.0.2

Third party tools used by cazy_webscraper:
beautifulsoup4: 4.12.3
biopython: 1.83
bioservices: 1.11.2
html5lib: 1.1
lxml: 5.1.0
mechanicalsoup: 1.2.0
numpy: 1.24.4
pandas: 2.0.3
requests: 2.31.0
saintBioutils: 0.0.25
sqlalchemy: 1.4.20
tqdm: 4.66.1

Usage: $cazy_webscraper -o /media/bharat/volume2/db/cazy_db/

Error: (in part) Retrieving organism from NCBI: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 195/195 [00:00<00:00, 16099.55it/s] Batch retrieving tax info from NCBI. Batch size:200: 24%|███████████████████████████▏ | 63/267 [07:41<22:02, 6.48s/itGenBank accession U07046.1 retrieved from NCBI, but it is not present in CAZy | 0/186 [00:00<?, ?it/s] GenBank accession D35024.1 retrieved from NCBI, but it is not present in CAZy GenBank accession AAA36604.1 retrieved from NCBI, but it is not present in CAZy GenBank accession L20302.1 retrieved from NCBI, but it is not present in CAZy GenBank accession T10384.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z28826.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z32689.1 retrieved from NCBI, but it is not present in CAZy GenBank accession AAA75530.1 retrieved from NCBI, but it is not present in CAZy GenBank accession 1910235A retrieved from NCBI, but it is not present in CAZy GenBank accession AAB29406.1 retrieved from NCBI, but it is not present in CAZy GenBank accession S36959 retrieved from NCBI, but it is not present in CAZy GenBank accession F26421 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession CAA40611.1 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession CAA38391.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X52871.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A29784 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession X62332.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X07049.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A25114 retrieved from NCBI, but it is not present in CAZy GenBank accession N1AT1F retrieved from NCBI, but it is not present in CAZy GenBank accession R3YM19 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession A29851 retrieved from NCBI, but it is not present in CAZy GenBank accession PT0677 retrieved from NCBI, but it is not present in CAZy GenBank accession V01347.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA46179.1 retrieved from NCBI, but it is not present in CAZy GenBank accession IJHULM retrieved from NCBI, but it is not present in CAZy GenBank accession X59684.1 retrieved from NCBI, but it is not present in CAZy GenBank accession S09590 retrieved from NCBI, but it is not present in CAZy GenBank accession Z14261.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X53247.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X15852.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA36113.1 retrieved from NCBI, but it is not present in CAZy GenBank accession YLDGA retrieved from NCBI, but it is not present in CAZy GenBank accession X06648.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA43904.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A26574 retrieved from NCBI, but it is not present in CAZy GenBank accession S15418 retrieved from NCBI, but it is not present in CAZy GenBank accession S04853 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA33308.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X02553.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A35514 retrieved from NCBI, but it is not present in CAZy GenBank accession A25470 retrieved from NCBI, but it is not present in CAZy GenBank accession X12802.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X17263.1 retrieved from NCBI, but it is not present in CAZy GenBank accession B25126 retrieved from NCBI, but it is not present in CAZy GenBank accession B41268 retrieved from NCBI, but it is not present in CAZy GenBank accession X70344.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA79521.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X82425.1 retrieved from NCBI, but it is not present in CAZy GenBank accession P38360.1 retrieved from NCBI, but it is not present in CAZy GenBank accession T53233.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X83272.1 retrieved from NCBI, but it is not present in CAZy GenBank accession T11539.1 retrieved from NCBI, but it is not present in CAZy GenBank accession AAA83640.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA43271.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA42647.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z20020.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z21512.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA38673.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X03927.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X55026.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X12488.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X66456.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X62105.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA44103.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA31076.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X57366.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X62183.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA40202.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X52998.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X14549.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA68574.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X52999.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA33626.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA26307.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X02170.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X05295.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA40747.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X02169.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X70895.2 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession X70917.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X70939.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X06087.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X53545.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA30772.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X57445.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA30835.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA32504.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA28227.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA40690.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X15840.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X51805.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z11758.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA46899.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA34135.1 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession Z11478.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA30602.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z21556.1 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession Z19555.1 retrieved from NCBI, but it is not present in CAZy GenBank accession T20780.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A15074.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A15068.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X72758.1 retrieved from NCBI, but it is not present in CAZy GenBank accession 1GRG retrieved from NCBI, but it is not present in CAZy GenBank accession T20781.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A14803.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA40804.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA28535.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X04834.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X61189.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X66818.4 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession CAA44572.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X68820.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X60101.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA30028.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z33689.1 retrieved from NCBI, but it is not present in CAZy GenBank accession U06606.1 retrieved from NCBI, but it is not present in CAZy GenBank accession T16118.1 retrieved from NCBI, but it is not present in CAZy GenBank accession T14744.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A11911.1 retrieved from NCBI, but it is not present in CAZy GenBank accession D20349.1 retrieved from NCBI, but it is not present in CAZy GenBank accession U06600.1 retrieved from NCBI, but it is not present in CAZy GenBank accession P32966 retrieved from NCBI, but it is not present in CAZy GenBank accession A48823 retrieved from NCBI, but it is not present in CAZy GenBank accession T20553.1 retrieved from NCBI, but it is not present in CAZy GenBank accession AAC22465.1 retrieved from NCBI, but it is not present in CAZy GenBank accession AAB06350.1 retrieved from NCBI, but it is not present in CAZy GenBank accession D41620.1 retrieved from NCBI, but it is not present in CAZy GenBank accession L12065.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA82878.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA01324.1 retrieved from NCBI, but it is not present in CAZy GenBank accession S30582 retrieved from NCBI, but it is not present in CAZy GenBank accession 1906283A retrieved from NCBI, but it is not present in CAZy GenBank accession Q02096.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z12550.1 retrieved from NCBI, but it is not present in CAZy GenBank accession A06196.1 retrieved from NCBI, but it is not present in CAZy GenBank accession D13981.1 retrieved from NCBI, but it is not present in CAZy GenBank accession F08949.1 retrieved from NCBI, but it is not present in CAZy GenBank accession I05112.1 retrieved from NCBI, but it is not present in CAZy GenBank accession AAC53945.1 retrieved from NCBI, but it is not present in CAZy GenBank accession T36429.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X12696.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA32850.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X03771.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z14729.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA28083.1 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession X51379.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X07229.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X59604.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X03921.1 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession X04411.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA30712.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X59117.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X04599.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X63791.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X67017.1 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession CAA50242.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA43926.1 retrieved from NCBI, but it is not present in CAZy GenBank accession Z14830.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X04298.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA28092.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X04541.1 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession X69096.6 retrieved from NCBI, but it is not present in CAZy GenBank accession X06074.1 retrieved from NCBI, but it is not present in CAZy GenBank accession V01350.1 retrieved from NCBI, but it is not present in CAZy GenBank accession CAA35001.1 retrieved from NCBI, but it is not present in CAZy GenBank accession X01002.1 retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession retrieved from NCBI, but it is not present in CAZy GenBank accession S25834 retrieved from NCBI, but it is not present in CAZy GenBank accession A04408.1 retrieved from NCBI, but it is not present in CAZy GenBank accession 1307222C retrieved from NCBI, but it is not present in CAZy GenBank accession T08021.1 retrieved from NCBI, but it is not present in CAZy Retrieving organism from NCBI: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 186/186 [00:00<00:00, 15394.07it/s] Batch retrieving tax info from NCBI. Batch size:200: 24%|███████████████████████████▌ | 64/267 [07:50<24:51, 7.35s/it] Traceback (most recent call last): File "/home/bharat/mambaforge/envs/cazy_webscraper/bin/cazy_webscraper", line 10, in sys.exit(main()) File "/home/bharat/mambaforge/envs/cazy_webscraper/lib/python3.8/site-packages/cazy_webscraper/cazy_scraper.py", line 268, in main get_cazy_data( File "/home/bharat/mambaforge/envs/cazy_webscraper/lib/python3.8/site-packages/cazy_webscraper/cazy_scraper.py", line 378, in get_cazy_data cazy_data, successful_replacement = replace_multiple_tax( File "/home/bharat/mambaforge/envs/cazy_webscraper/lib/python3.8/site-packages/cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 173, in replace_multiple_tax cazy_data = get_ncbi_tax(epost_results, cazy_data, replaced_taxa_logger, args) File "/home/bharat/mambaforge/envs/cazy_webscraper/lib/python3.8/site-packages/cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 204, in get_ncbi_tax protein_records = Entrez.read(record_handle, validate=False) File "/home/bharat/mambaforge/envs/cazy_webscraper/lib/python3.8/site-packages/Bio/Entrez/init.py", line 518, in read record = handler.read(source) File "/home/bharat/mambaforge/envs/cazy_webscraper/lib/python3.8/site-packages/Bio/Entrez/Parser.py", line 409, in read raise CorruptedXMLError(e) from None Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data (not well-formed (invalid token): line 1117, column 20). Please make sure that the input data are not corrupted.

Setup

Please provide a brief summary of your setup/computer you are using. For example:

Desktop (please complete the following information): Ubuntu, 18.04 mambaforge

HobnobMancer commented 5 months ago

As mentioned in #124 this is a duplicate of #120 - relating to an issue with parsing incomplete XML files from NCBI. This issue will be closed and work will continue on #120.