Closed JonathanAbrahams1337 closed 10 months ago
Hi Jonathan,
Thank you for bringing this to my attention and for describing the issue so clearly. I am travelling for the next two weeks but will find an evening to dig into it and try and resolve. If RefSeq had some sort of download API I could wrap, that would probably be more robust than constructing links and/or querying files directly, but I don't think they do. So I think you are right, solution is via the assembly_summary.txt file.
Thanks again, Jason
That would be great, thanks!
I dont think there is an easier way. Generally when I am downloading lots of genomes I just get the urls from the summary file.
Here is a good tool to download genomes, but it is just doing this exact process. https://github.com/kblin/ncbi-genome-download/tree/master
https://github.com/kblin/ncbi-genome-download/blob/master/ncbi_genome_download/core.py
https://github.com/kblin/ncbi-genome-download/blob/master/ncbi_genome_download/core.py#L402
Hi Jonathan,
I think I have fixed this now and pulled into master. I followed your advice and extract the FTPs for RefSeq via the assembly_summary.txt files. The main changes are to the RefSeqGenomesFactory
class, in here:
https://github.com/JasonAHendry/multiply/blob/master/src/multiply/download/genomes.py
As it stands, this slows down the API a bit because the Genome
dataclasses get created everytime you run multiply, and now for those using RefSeq this (may) involve downloading these assembly_summary.txt files. I am pondering the nicest way to fix this, but wanted to merge now so you could play.
I added a toy design file for TB, here:
https://github.com/JasonAHendry/multiply/blob/master/designs/tb-amr.ini
This ran OK for me. Let me know if you hit any further issues, and I'd also be happy to jump on a quick call if you'd like any tips / direction on how to run multiply.
HTH, Jason
Hi,
This looks like a really promising tool. Excited to use it. Unfortunately I have stumbled on the first hurdle!
I am looking to create a multiplex PCR design for Mycobacterium tuberuclosis. But I was not able to download the reference genome.
My collection.ini file has this appended to it, which seems to be fine.
When running
multiply download -g MycobacteriumTuberculosis
, the URLhttps://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Mycobacterium_tuberculosis/all_assembly_versions/GCF_000195955.2/GCF_000195955.2_genomic.fna.gz
is tried, but this leads to a 404. It can be found in the following readme file that genomes which have over 1000 assemblies uploaded follow a different path for download https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Mycobacterium_tuberculosis/all_assembly_versions/README.txtThis leads you to here. I imagine a solution to this problem would be to get this file and lookup the assembly name and download from here.
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Mycobacterium_tuberculosis/assembly_summary.txt
Jonathan