JasonAHendry / multiply

Multiplex PCR design, in silico
MIT License
11 stars 3 forks source link

Unexpected URL when downloading bacterial genomes #3

Closed JonathanAbrahams1337 closed 10 months ago

JonathanAbrahams1337 commented 10 months ago

Hi,

This looks like a really promising tool. Excited to use it. Unfortunately I have stumbled on the first hurdle!

I am looking to create a multiplex PCR design for Mycobacterium tuberuclosis. But I was not able to download the reference genome.

My collection.ini file has this appended to it, which seems to be fine.

[MycobacteriumTuberculosis]
source = refseq
clade = bacteria
genus = mycobacterium
species = tuberculosis
assembly = GCF_000195955.2

When running multiply download -g MycobacteriumTuberculosis, the URL https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Mycobacterium_tuberculosis/all_assembly_versions/GCF_000195955.2/GCF_000195955.2_genomic.fna.gz is tried, but this leads to a 404. It can be found in the following readme file that genomes which have over 1000 assemblies uploaded follow a different path for download https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Mycobacterium_tuberculosis/all_assembly_versions/README.txt

This leads you to here. I imagine a solution to this problem would be to get this file and lookup the assembly name and download from here.

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Mycobacterium_tuberculosis/assembly_summary.txt

Jonathan

JasonAHendry commented 10 months ago

Hi Jonathan,

Thank you for bringing this to my attention and for describing the issue so clearly. I am travelling for the next two weeks but will find an evening to dig into it and try and resolve. If RefSeq had some sort of download API I could wrap, that would probably be more robust than constructing links and/or querying files directly, but I don't think they do. So I think you are right, solution is via the assembly_summary.txt file.

Thanks again, Jason

JonathanAbrahams1337 commented 10 months ago

That would be great, thanks!

I dont think there is an easier way. Generally when I am downloading lots of genomes I just get the urls from the summary file.

Here is a good tool to download genomes, but it is just doing this exact process. https://github.com/kblin/ncbi-genome-download/tree/master

https://github.com/kblin/ncbi-genome-download/blob/master/ncbi_genome_download/core.py

https://github.com/kblin/ncbi-genome-download/blob/master/ncbi_genome_download/core.py#L402

JasonAHendry commented 10 months ago

Hi Jonathan,

I think I have fixed this now and pulled into master. I followed your advice and extract the FTPs for RefSeq via the assembly_summary.txt files. The main changes are to the RefSeqGenomesFactory class, in here:

https://github.com/JasonAHendry/multiply/blob/master/src/multiply/download/genomes.py

As it stands, this slows down the API a bit because the Genome dataclasses get created everytime you run multiply, and now for those using RefSeq this (may) involve downloading these assembly_summary.txt files. I am pondering the nicest way to fix this, but wanted to merge now so you could play.

I added a toy design file for TB, here:

https://github.com/JasonAHendry/multiply/blob/master/designs/tb-amr.ini

This ran OK for me. Let me know if you hit any further issues, and I'd also be happy to jump on a quick call if you'd like any tips / direction on how to run multiply.

HTH, Jason