HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
12 stars 3 forks source link

Bio.Entrez NotXMLError #95

Closed HobnobMancer closed 2 years ago

HobnobMancer commented 2 years ago

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

While retrieving protein sequences from NCBI, if the Bio.Entrez NotXMLError is raised, the tool crashes and does not retrieve any of the remaining protein sequences.

To Reproduce

Please include the specific steps (including all code) you performed, so that we can check if the behaviour can be reproduced:

Command: cw_get_genbank_seqs all_cazy_2022-08-22.db <email> --families GH50

Error:

Traceback (most recent call last):
  File "/home/user/anaconda3/.../cw_get_genbank_seqs", line 33, in <module>
    sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cw_get_genbank_seqs')())
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 160, in main
    seq_dict, no_seq = get_sequences(genbank_accessions, args)  # {gbk_accession: seq}
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 297, in get_sequences
    seq_dict, success_accessions, failed_accessions = retry_failed_queries(
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 366, in retry_failed_queries
    new_seq_dict, no_seq = get_sequences(query, args, retry=True)
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 223, in get_sequences
    epost_webenv, epost_query_key = bulk_query_ncbi(query_list, args)
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 337, in bulk_query_ncbi
    epost_result = Entrez.read(
  File "/home/user/anaconda3/.../Bio/Entrez/__init__.py", line 508, in read
    record = handler.read(handle)
  File "/home/user/anaconda3/.../Bio/Entrez/Parser.py", line 345, in read
    raise NotXMLError(e) from None
Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (no element found: line 1, column 0). Please make sure that the input data are in XML format.

Expected behavior

cazy_webscrapershould be able to handle this error and continue on retrieving the rest of protein sequences.

HobnobMancer commented 2 years ago

Fixed with PR 96 - release v2.2.1