AstrobioMike / bit

Bioinformatics Tools
GNU General Public License v3.0
81 stars 11 forks source link

Batch downloading fasta files from a list of Accession# #3

Closed kwnamhang closed 4 years ago

kwnamhang commented 4 years ago

Hi Mike,

Thanks for making this tool - exactly what I'm looking for.

I'm trying to batch download fasta files from a list of Genbank Accessions that I've made as .csv file.

However, when I try to run the command, I get the following:

~/Downloads$ bit-dl-ncbi-assemblies -w list.csv -f fasta -j 10

Targeting 157 genomes in fasta format.

Downloading ncbi assembly summaries to be able to construct ftp links...

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0 curl: (28) Connection timed out after 30000 milliseconds Warning: Transient problem: timeout Will retry in 1 seconds. 10 retries left. 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0 curl: (28) Connection timed out after 30000 milliseconds Warning: Transient problem: timeout Will retry in 2 seconds. 9 retries left. 0 0 0 0 0 0 0 0 --:--:-- 0:00:29 --:--:-- 0 curl: (28) Connection timed out after 30000 milliseconds Warning: Transient problem: timeout Will retry in 4 seconds. 8 retries left. 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0 curl: (28) Connection timed out after 30000 milliseconds Warning: Transient problem: timeout Will retry in 8 seconds. 7 retries left. 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0 curl: (28) Connection timed out after 30000 milliseconds Warning: Transient problem: timeout Will retry in 16 seconds. 6 retries left. 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0 curl: (28) Connection timed out after 30000 milliseconds Warning: Transient problem: timeout Will retry in 32 seconds. 5 retries left. 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0 curl: (28) Connection timed out after 30000 milliseconds Warning: Transient problem: timeout Will retry in 64 seconds. 4 retries left. 0 0 0 0 0 0 0 0 --:--:-- 0:00:29 --:--:-- 0 curl: (28) Connection timed out after 30000 milliseconds Warning: Transient problem: timeout Will retry in 128 seconds. 3 retries left. 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0 curl: (28) Connection timed out after 30001 milliseconds Warning: Transient problem: timeout Will retry in 256 seconds. 2 retries left.

Is there an easy fix to this? Is it because I'm requesting from .csv rather than .txt file? Any other parameters I need to set?

Many thanks for your help!

AstrobioMike commented 4 years ago

Hi there! Thanks :)

And sorry you’re having trouble! It won’t matter what your extension is (csv) so long as the file is still a single column of accessions. And the way you’re trying to run it is spot on 👍

It seems the problem you’re having right now is that the initial download of the reference tables isn’t working (we need those to build the links to get the genomes). It might be that your system doesn’t allow ftp transfer, which is currently how it’s trying to do it (and then would also be using that for each genome). I’ve added an option for using http instead for cases like this in another one of my programs, so I’d be happy to add that here too if it will help :)

Can you try this command and see if it works:

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt -o ncbi_RS_assembly_info.tmp

If it hangs for a while (like 20-30 seconds) and isn’t doing anything, cancel it with ctrl + c.

And then try this one and see if it works?

curl https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt -o ncbi_RS_assembly_info.tmp

And let me know if that one successfully downloads the file :)

kwnamhang commented 4 years ago

Thanks for your prompt reply!

I can confirm that the first command you gave works - downloading via ftp.

Also, I've now re-run my original command "bit-dl-ncbi-assemblies -w list.csv -f fasta -j 10" and it's now downloading fine via ftp!

As you say, it may be that my work network doesn't allow ftp transfer (based within a hospital laboratory). Seems to work fine over my home network.

I'll try the http-based command when I'm next at work. Ideally, would be great to be able to use your tool at work also.

Thanks again for your help and making this tool :+1:

AstrobioMike commented 4 years ago

Oh great :)

I will add the capability to be able to do it through http tomorrow and let you know when it’s in, either way the option will be helpful to have, and hopefully it will work on your work network🤞

kwnamhang commented 4 years ago

Hmmm... perhaps I spoke too soon... I thought it began downloading but when it finished, says none of the accessions were found? I've manually checked the accessions, and they are definitely searchable on NCBI. Not sure what's the issue. Sorry to bug you!

bit-dl-ncbi-assemblies -w list.txt -f fasta -j 10

Targeting 157 genomes in fasta format.

Downloading ncbi assembly summaries to be able to construct ftp links...

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 238M 100 238M 0 0 216k 0 0:18:47 0:18:47 --:--:-- 297k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 60.2M 100 60.2M 0 0 200k 0 0:05:07 0:05:07 --:--:-- 154k

 ******************************* NOTICE *******************************  
      157 input accessions were not found at NCBI.

      Written to "NCBI-accessions-not-found.txt".
 ********************************************************************** 

  Remaining total targets: 0
kwnamhang commented 4 years ago

Sorry, I'm an idiot , please disregard my post above. Just realized that your tool searches only the Assembly database of NCBI. My accessions are for the Nucleotide database. Would you think it's possible easy for me to modify your code to search accessions against the Nucleotide db?

Thanks so much for your help!

AstrobioMike commented 4 years ago

Oops, just saw your follow up.

No unfortunately this won’t be the tool for that :/

But NCBI’s e-direct tool can likely get the job done after some figuring. It’s super powerful, but not very user-friendly. I have a page on my site of some examples from things I’ve figured out before here: https://astrobiomike.github.io/unix/ncbi_eutils

That can hopefully help get you started :)

AstrobioMike commented 4 years ago

Oh and I forgot about this tool that fortunately I noted at the top of that page I just linked to, maybe this can grab what you’re looking for :)

https://github.com/kblin/ncbi-acc-download