Closed fbartusch closed 6 months ago
Hi,
I am currently on vacation.
I am not aware of a breaking change of the NCBI Entrez Db.
Could you check if the command works on a machine outside the HPC, to make sure its not an connectivity issue?
Keep me updated and I will look into this issue when I am back to work.
Hi, thanks for the fast answer.
I cannot use the container on my local machine as I don't have enough space for downloading blastdb. But I'm testing the command line E-utilities on my machine. I logged the UIDs of Cutibacterium_avidum assemblies that are found at this line in the code.
Then I try to get one assembly, but it returns HTTP/1.1 400 Bad Request
. I get the same error if I use it on the cluster.
Other interactions with the assembly db are working (einfo, esearch obviously).
efetch -email felix.bartusch@uni-tuebingen.de -db assembly -id 21567681
Indeed there will be changes in the assembly database in the near future.
But I don't know if this has effects on the utilities used in speciesprimer. I will write a mail to NCBI and ask if they know anything.
Thanks for looking into that, then it seems indeed to be a problem with the entrez query/database.
A quick workaround could be to download the assemblies for the species using ncbi-genome-download and then run speciesprimer with the --offline option.
Example command:
'ncbi-genome-download --genera "Cutibacterium avidum" --assembly-levels complete --formats fasta --output-folder /primerdesign/Cutibacterium_avidum/fna_files --flat-output bacteria'
Here is the answer from the NCBI:
Dear Colleague,
Technically, assembly is not a supported database for the E-Utilities: https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly. So if it was working (which it probably should not have) there won't be any traction to fix it.
You can still make an esummary call (even though it is not technically supported): https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=assembly&id=21567681
Alternatively, consider adapting your methods to use the more modern API system in NCBI Datasets: https://ncbi.nlm.nih.gov/datasets/genome/GCF_036907375.1/
The datasets command line tool and curl hotlinks are there, but you can also call the various endpoints in your language of interest should you choose to.
Best regards,
User Services NCBI | NLM | N IH
For me that's very surprising and unexpected. Last week I started testing the dataset tools. I was able to get a list of complete assemblies in JSON and extract AssemblyAccession
, AssemblyName
, and AssemblyStatus
like the collect_genomedata
in the pipeline does. But the JSON didn't contain any FtpPath_RefSeq
. But maybe the the dataset tool is able to retrieve the assembly via accession and does not need the FtpPath_RefSeq
.
Hi thanks again for looking into the details.
The answer from NCBI is indeed very surprising. The assembly database can be searched and summarized pretty well, according to their own examples at least (e.g. in https://www.ncbi.nlm.nih.gov/books/NBK565821/) and it is still working on my personal setup.
However. if you are already looking into NCBI datasets, you may want to use this. As it seems unbelievable fast for downloading genomes compared to the ftp method. As far as I see you can filter the AssemblyStatus already in the search and directly download the taxon genomes / alternatively you should be able to add a list of AssemblyAccessions to download the genomes.
I ran the example (https://www.ncbi.nlm.nih.gov/books/NBK565821/) with all the pipes and it worked. But if you just run efetch on a set of IDs, it stops working. And that is what your pipeline does :( This behavior is really strange.
I tried to install the Python package ncbi-datasets-pylib in the container. But there is the usual Python version mismatch. ncbi-datasets-pylib needs Python >= 3.7
, but the container has Python 3.5.2
.
I tried to add ncbi-datasets-pylib to the speciesprimerdeps container in order to rebuild the whole thing, but micromamba ran for 2 days trying to solve the environment. Then I cancelled the build.
Did you see the information on NCBI's website: https://www.ncbi.nlm.nih.gov/assembly/GCF_000008865.2/?shouldredirect=false
Important Update Effective June 2024, NCBI's Assembly resource will no longer be available. NCBI Assembly data can now be found on the NCBI Datasets genome pages. Learn more.
I will look into it and try to provide a running container including NCBI datasets. The problem is that re-building the entire container now 5 years later will probably break most of the dependencies .
By the way the workaround (for the python version problem) for now... you can download the genomes in any environment/container and save it in a folder named as the species name (underlines instead of spaces)/genomic_fna and then mount the parent directory as the primerdesign directory.
e.g. local_dir/primerdesign/Cutibacterium_avidum/genomic_fna
"--bind local_dir/primerdesign:/primerdesign"
and then run speciesprimer with the offline option.
Hi @biologger
Thank you for your help and response and for developing this tool (which I believe is very unique).
I am the user that @fbartusch mentioned he is helping him.
Just a side update
I have last week succeded to get results by using the docker container on an external PC with UBUNTO 22.04 with ref_prok_rep_genomes DB and the terminal-based mode sudo docker exec -it speciesprimer bash
mode. The web-based mode did not for I reason I do not know).
Though important to mention that the Felix-based container was better/ standardized for me as it runs on HPC, uses the big database, and is used by other users who may share a similar question in our HPC.
Best, Ahmed
Hi
@AhmedElsherbini thanks for the update! I am happy to hear you got your results.
Regarding the "web-based mode", I will end the support for the GUI based speciesprimer as it did not age that well and I did not get support/funding for maintaining this repo during the last years. However, as long as the command line version is working in docker containers, I'm willing to provide support and updates for now, e.g. for the changed database structure of NCBI.
@fbartusch if other users want to use speciesprimer...
docker pull biologger/speciesprimerdeps:v2.1.3
docker pull biologger/speciesprimer:v2.1.3
I hope this helps, and let me know if you need further support, otherwise please close the issue and feel free to open a new one if required.
A testjob with the new container runs now for several hours and data was downloaded from NCBI. It seems to work now. Thank you very much for fixing this issue!
One of our HPC users has problems running the Speciesprimer pipeline v2.1.2. The container worked two years ago and it's the same container today.
The error log:
This is the command that runs the Container. It's a Singularity container built directly from the Docker container:
The error occurs here
Do you know if there were any changes in the NCBI databases that cause this error? Is there a way to fix it?