Error during Entrez.efetch

fbartusch commented 8 months ago

One of our HPC users has problems running the Speciesprimer pipeline v2.1.2. The container worked two years ago and it's the same container today.

The error log:

26 Mar 2024 15:13:08: ['fatal error while working on', 'Cutibacterium_avidum', 'check logfile', '<pathToLogfile>']
fatal error while working on Cutibacterium_avidum
Traceback (most recent call last):
  File "/pipeline/speciesprimer.py", line 4153, in main
    run_pipeline_for_target(target, config)
  File "/pipeline/speciesprimer.py", line 4041, in run_pipeline_for_target
    newconfig = DataCollection(config).collect()
  File "/pipeline/speciesprimer.py", line 647, in collect
    self.get_ncbi_links(taxid)
  File "/pipeline/speciesprimer.py", line 276, in get_ncbi_links
    genomedata = collect_genomedata(taxid, email)
  File "/pipeline/speciesprimer.py", line 223, in collect_genomedata
    assembly_records = Entrez.read(assembly_efetch)
  File "/usr/local/lib/python3.5/dist-packages/Bio/Entrez/__init__.py", line 478, in read
    record = handler.read(handle)
  File "/usr/local/lib/python3.5/dist-packages/Bio/Entrez/Parser.py", line 334, in read
    self.parser.ParseFile(handle)
  File "../Modules/pyexpat.c", line 414, in StartElement
  File "/usr/local/lib/python3.5/dist-packages/Bio/Entrez/Parser.py", line 571, in startElementHandler
    raise ValidationError(tag)
Bio.Entrez.Parser.ValidationError: Failed to find tag 'AnnotRptUrl' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.

This is the command that runs the Container. It's a Singularity container built directly from the Docker container:

singularity exec \
 --bind <PathToLocalDB>:/blastdb \
 --bind <PathToLocalDir>:/primerdesign \
 --writable-tmpfs \
 speciesprimer_v2.1.2.sif \
 /pipeline/speciesprimer.py --target Cutibacterium_avidum  --email  <Mail> --assemblylevel complete

The error occurs here

assembly_efetch = Entrez.efetch(
    db="assembly",
    id=uidlist,
    rettype="docsum",
    retmode="xml")

assembly_records = Entrez.read(assembly_efetch)

Do you know if there were any changes in the NCBI databases that cause this error? Is there a way to fix it?

biologger commented 8 months ago

Hi,

I am currently on vacation.

I am not aware of a breaking change of the NCBI Entrez Db.

Could you check if the command works on a machine outside the HPC, to make sure its not an connectivity issue?

Keep me updated and I will look into this issue when I am back to work.

fbartusch commented 8 months ago

Hi, thanks for the fast answer.

I cannot use the container on my local machine as I don't have enough space for downloading blastdb. But I'm testing the command line E-utilities on my machine. I logged the UIDs of Cutibacterium_avidum assemblies that are found at this line in the code.

Then I try to get one assembly, but it returns HTTP/1.1 400 Bad Request. I get the same error if I use it on the cluster. Other interactions with the assembly db are working (einfo, esearch obviously).

efetch -email felix.bartusch@uni-tuebingen.de -db assembly -id 21567681

Indeed there will be changes in the assembly database in the near future.

https://ncbiinsights.ncbi.nlm.nih.gov/2023/10/18/ncbi-datasets-access-sequence-data/
Read the "Important Update" box on top: https://www.ncbi.nlm.nih.gov/assembly/help/

But I don't know if this has effects on the utilities used in speciesprimer. I will write a mail to NCBI and ask if they know anything.

biologger commented 8 months ago

Thanks for looking into that, then it seems indeed to be a problem with the entrez query/database.

A quick workaround could be to download the assemblies for the species using ncbi-genome-download and then run speciesprimer with the --offline option.

Example command:

'ncbi-genome-download --genera "Cutibacterium avidum" --assembly-levels complete --formats fasta --output-folder /primerdesign/Cutibacterium_avidum/fna_files --flat-output bacteria'

fbartusch commented 7 months ago

Here is the answer from the NCBI:

Dear Colleague,

Technically, assembly is not a supported database for the E-Utilities: https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly. So if it was working (which it probably should not have) there won't be any traction to fix it.

You can still make an esummary call (even though it is not technically supported): https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=assembly&id=21567681

Alternatively, consider adapting your methods to use the more modern API system in NCBI Datasets: https://ncbi.nlm.nih.gov/datasets/genome/GCF_036907375.1/

The datasets command line tool and curl hotlinks are there, but you can also call the various endpoints in your language of interest should you choose to.

Best regards,

User Services NCBI | NLM | N IH

For me that's very surprising and unexpected. Last week I started testing the dataset tools. I was able to get a list of complete assemblies in JSON and extract AssemblyAccession, AssemblyName, and AssemblyStatus like the collect_genomedata in the pipeline does. But the JSON didn't contain any FtpPath_RefSeq. But maybe the the dataset tool is able to retrieve the assembly via accession and does not need the FtpPath_RefSeq.

biologger commented 7 months ago

Hi thanks again for looking into the details.

The answer from NCBI is indeed very surprising. The assembly database can be searched and summarized pretty well, according to their own examples at least (e.g. in https://www.ncbi.nlm.nih.gov/books/NBK565821/) and it is still working on my personal setup.

However. if you are already looking into NCBI datasets, you may want to use this. As it seems unbelievable fast for downloading genomes compared to the ftp method. As far as I see you can filter the AssemblyStatus already in the search and directly download the taxon genomes / alternatively you should be able to add a list of AssemblyAccessions to download the genomes.

fbartusch commented 7 months ago

I ran the example (https://www.ncbi.nlm.nih.gov/books/NBK565821/) with all the pipes and it worked. But if you just run efetch on a set of IDs, it stops working. And that is what your pipeline does :( This behavior is really strange.

I tried to install the Python package ncbi-datasets-pylib in the container. But there is the usual Python version mismatch. ncbi-datasets-pylib needs Python >= 3.7, but the container has Python 3.5.2. I tried to add ncbi-datasets-pylib to the speciesprimerdeps container in order to rebuild the whole thing, but micromamba ran for 2 days trying to solve the environment. Then I cancelled the build.

Did you see the information on NCBI's website: https://www.ncbi.nlm.nih.gov/assembly/GCF_000008865.2/?shouldredirect=false

Important Update Effective June 2024, NCBI's Assembly resource will no longer be available. NCBI Assembly data can now be found on the NCBI Datasets genome pages. Learn more.

biologger commented 7 months ago

I will look into it and try to provide a running container including NCBI datasets. The problem is that re-building the entire container now 5 years later will probably break most of the dependencies .

biologger commented 7 months ago

By the way the workaround (for the python version problem) for now... you can download the genomes in any environment/container and save it in a folder named as the species name (underlines instead of spaces)/genomic_fna and then mount the parent directory as the primerdesign directory.

e.g. local_dir/primerdesign/Cutibacterium_avidum/genomic_fna

"--bind local_dir/primerdesign:/primerdesign"

and then run speciesprimer with the offline option.

AhmedElsherbini commented 7 months ago

Hi @biologger

Thank you for your help and response and for developing this tool (which I believe is very unique).

I am the user that @fbartusch mentioned he is helping him.

Just a side update

I have last week succeded to get results by using the docker container on an external PC with UBUNTO 22.04 with ref_prok_rep_genomes DB and the terminal-based mode sudo docker exec -it speciesprimer bash mode. The web-based mode did not for I reason I do not know).

Though important to mention that the Felix-based container was better/ standardized for me as it runs on HPC, uses the big database, and is used by other users who may share a similar question in our HPC.

Best, Ahmed

biologger commented 7 months ago

Hi

@AhmedElsherbini thanks for the update! I am happy to hear you got your results.

Regarding the "web-based mode", I will end the support for the GUI based speciesprimer as it did not age that well and I did not get support/funding for maintaining this repo during the last years. However, as long as the command line version is working in docker containers, I'm willing to provide support and updates for now, e.g. for the changed database structure of NCBI.

@fbartusch if other users want to use speciesprimer...

I have updated the dependencies and included ncbi datasets-cli, the docker container with the updated dependencies can be pulled with docker pull biologger/speciesprimerdeps:v2.1.3
The updated container using datasets to download genomes/assemblies to run speciesprimer can be pulled with docker pull biologger/speciesprimer:v2.1.3

I hope this helps, and let me know if you need further support, otherwise please close the issue and feel free to open a new one if required.

fbartusch commented 6 months ago

A testjob with the new container runs now for several hours and data was downloaded from NCBI. It seems to work now. Thank you very much for fixing this issue!

biologger / speciesprimer

Error during Entrez.efetch #27