HobnobMancer / cazomevolve

`cazomevolve` ('cazome-evolve') investigates the evolution of CAZomes, and identifies CAZy families that co-occur within the genomes of candidate species, more frequently than would be expected by lineage.
https://hobnobmancer.github.io/cazomevolve/
MIT License
4 stars 1 forks source link

Issue with cazywebscraper creating local database when calling it using the build_cazy_db command #22

Closed PeterMBlack closed 3 months ago

PeterMBlack commented 5 months ago

When trying to build a new local CAZy db with the build_cazy_db command, it gets to 24% of the batches retrieving the tax info from NCBI and throws an error saying its failed to parse the XML data. This is fatal and means the resulting database is incomplete, the traceback is as follows:

Retrieving organism from NCBI: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 191/191 [00:00<00:00, 11539.74it/s]
Batch retrieving tax info from NCBI. Batch size:200:  24%|█████████████████████████████▎                                                                                              | 63/266 [06:48<21:57,  6.49s/it]
Traceback (most recent call last):
  File "/home/pmb9/anaconda3/envs/dbcan_cazomevolve_3.9/bin/cazy_webscraper", line 8, in <module>
    sys.exit(main())
  File "/home/pmb9/anaconda3/envs/dbcan_cazomevolve_3.9/lib/python3.9/site-packages/cazy_webscraper/cazy_scraper.py", line 268, in main
    get_cazy_data(
  File "/home/pmb9/anaconda3/envs/dbcan_cazomevolve_3.9/lib/python3.9/site-packages/cazy_webscraper/cazy_scraper.py", line 378, in get_cazy_data
    cazy_data, successful_replacement = replace_multiple_tax(
  File "/home/pmb9/anaconda3/envs/dbcan_cazomevolve_3.9/lib/python3.9/site-packages/cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 173, in replace_multiple_tax
    cazy_data = get_ncbi_tax(epost_results, cazy_data, replaced_taxa_logger, args)
  File "/home/pmb9/anaconda3/envs/dbcan_cazomevolve_3.9/lib/python3.9/site-packages/cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 204, in get_ncbi_tax
    protein_records = Entrez.read(record_handle, validate=False)
  File "/home/pmb9/anaconda3/envs/dbcan_cazomevolve_3.9/lib/python3.9/site-packages/Bio/Entrez/__init__.py", line 518, in read
    record = handler.read(source)
  File "/home/pmb9/anaconda3/envs/dbcan_cazomevolve_3.9/lib/python3.9/site-packages/Bio/Entrez/Parser.py", line 409, in read
    raise CorruptedXMLError(e) from None
Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data (not well-formed (invalid token): line 36349, column 20). Please make sure that the input data are not corrupted.

Also I'm unsure if this is related, but above this warning messages says all 200 of the entries have been retreived from NCBI, but arent present in CAZy, which seems like something is going wrong with comparing the GenBank accessions with what is in the downloaded CAZy txt file, although this is only my prediction, an example of the console output for the packet just before the error above arises is shown below:

GenBank accession P18024.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession T20648.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession U07995.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1911266A retrieved from NCBI, but it is not present in CAZy
GenBank accession D16111.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAB31321.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L19406.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA10945.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession S98727.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA11842.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAB23601.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession P12333.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1211240A retrieved from NCBI, but it is not present in CAZy
GenBank accession AAB28971.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA60901.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession JQ1574 retrieved from NCBI, but it is not present in CAZy
GenBank accession J03944.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession S44178.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession T68390.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA44484.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession S71431.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession U08203.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA57721.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession P34019.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X53241.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L11788.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X01826.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA72821.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 751456A retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession L19024.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession J00797.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession J00798.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession A41802 retrieved from NCBI, but it is not present in CAZy
GenBank accession U02360.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1819402D retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA60906.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X72669.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession A04670.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L31877.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession T05721.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession D42586 retrieved from NCBI, but it is not present in CAZy
GenBank accession M82621.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession TPCHCS retrieved from NCBI, but it is not present in CAZy
GenBank accession JQ1192 retrieved from NCBI, but it is not present in CAZy
GenBank accession W6WL42 retrieved from NCBI, but it is not present in CAZy
GenBank accession X53938.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA06360.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession H37532.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA45589.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession M35054.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession S61928.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA11138.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession P16500 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA78959.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession T05472.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession P20633.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA02011.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L14192.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA40531.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1310357A retrieved from NCBI, but it is not present in CAZy
GenBank accession AAB03000.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession Z26126.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAC53819.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAB28772.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L27542.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L33718.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X72046.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA39492.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X12965.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession X60724.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA37082.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA44363.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA77567.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA50214.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA40875.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X06743.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession M22218.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA40965.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession P31690 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1LOA_B retrieved from NCBI, but it is not present in CAZy
GenBank accession Z35004.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA82141.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1805207B retrieved from NCBI, but it is not present in CAZy
GenBank accession L21554.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1MUP retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA53424.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA31665.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L05615.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA67417.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L22339.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession D12618.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession S71659.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X76387.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession D28026.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L34745.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession U03117.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession T07409.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession S40351 retrieved from NCBI, but it is not present in CAZy
GenBank accession Z36943.8 retrieved from NCBI, but it is not present in CAZy
GenBank accession P14330 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA03368.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA40858.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession Q07857.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA55165.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1RNE retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA47443.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession A18111.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession D23774.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAB29165.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA19385.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession A10652.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession TVVPMH retrieved from NCBI, but it is not present in CAZy
GenBank accession S72324.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession U13145.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession T48908.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA87561.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA13325.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession D34770 retrieved from NCBI, but it is not present in CAZy
GenBank accession Z41540.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession I04184.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X12644.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession Z44618.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession D38941.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession T83760.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession A16632.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession I09598.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession D35297.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA16421.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA32518.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X04268.3 retrieved from NCBI, but it is not present in CAZy
GenBank accession S17906 retrieved from NCBI, but it is not present in CAZy
GenBank accession S21994 retrieved from NCBI, but it is not present in CAZy
GenBank accession N26282.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA31638.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession X08066.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA45320.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA36604.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession T10384.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession D35024.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession JN0614 retrieved from NCBI, but it is not present in CAZy
GenBank accession U07046.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAD10609.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession L20302.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession 1910235A retrieved from NCBI, but it is not present in CAZy
GenBank accession Z32689.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAA75530.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession AAB29406.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession S36959 retrieved from NCBI, but it is not present in CAZy
GenBank accession Z28826.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession F26421 retrieved from NCBI, but it is not present in CAZy
GenBank accession X52871.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA38391.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA40611.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession N1AT1F retrieved from NCBI, but it is not present in CAZy
GenBank accession X59684.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession V01347.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA46179.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X62332.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession IJHULM retrieved from NCBI, but it is not present in CAZy
GenBank accession A29851 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession A25114 retrieved from NCBI, but it is not present in CAZy
GenBank accession R3YM19 retrieved from NCBI, but it is not present in CAZy
GenBank accession X07049.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession  retrieved from NCBI, but it is not present in CAZy
GenBank accession PT0677 retrieved from NCBI, but it is not present in CAZy
GenBank accession A29784 retrieved from NCBI, but it is not present in CAZy
GenBank accession YLDGA retrieved from NCBI, but it is not present in CAZy
GenBank accession S09590 retrieved from NCBI, but it is not present in CAZy
GenBank accession S15418 retrieved from NCBI, but it is not present in CAZy
GenBank accession X12802.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X02553.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA43904.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession X53247.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession A35514 retrieved from NCBI, but it is not present in CAZy
GenBank accession X17263.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA33308.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession CAA79521.1 retrieved from NCBI, but it is not present in CAZy
GenBank accession B25126 retrieved from NCBI, but it is not present in CAZy
GenBank accession X70344.1 retrieved from NCBI, but it is not present in CAZy
HobnobMancer commented 5 months ago

Hi,

This is an issue with cazy_webscraper so I've raised an issue over there.

For progress please see the relevant cazy_webscraper issue. Once I believe the issue is resolved I'll post an update here.

HobnobMancer commented 3 months ago

Should be fixed with cazy_webscraper >= 2.3.0.3