flass / pantagruel

a pipeline for reconciliation of phylogenetic histories within a bacterial pangenome
GNU General Public License v3.0
46 stars 7 forks source link

NCBI manually download genomes #24

Closed mattbawn closed 4 years ago

mattbawn commented 4 years ago

Im running Pantagruel on a closed cluster. I have manually downloaded the NCBI genomes as indicated in pantagruel_pipeline_00_fetch_data.sh

using txid90371[Organism:exp] AND ("latest refseq"[filter] AND all[filter] NOT anomalous[filter]) AND ("complete genome"[filter])

and have a file R134_Pantagruel/NCBI/Taxonomy_2019-10-17/genome_assemblies.tar

However, after running: pantagruel -i database/environ_pantagruel_database.sh all

I get:

This is Pantagruel pipeline version 76858aaa0b4189a60271b5eb786d924fb8d6441b using source code from repository '/opt/software/pantagruel'
# will run tasks: 0 1 2 3 4 5 6 7 8 9
[2019-10-23 08:59:48] Pantagrel pipeline task 0: fetch public genome data from NCBI sequence databases and annotate private genomes.
Create new task folder '/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/database/00.input_data'
[2019-10-23 08:59:48] did not find the relevant taxonomy flat files in '/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/NCBI/Taxonomy_2019-10-17/'; download the from NCBI Taxonomy FTP

I have also tries to extract the tar ball but it is still not seen.

what am I doing wrong?

Thanks,

Matt

flass commented 4 years ago

ÏHi Matt,

It is hard to know what is going wrong just from the info above. I wpuld need you to give me the command you used to initiate the database (pantagruel ... init) so to know where the program is expecting to find the genome archive. The genome_assemblies.tar archive file should be placed directly in the folder designated by the -A option (see the doc via pantagruel -h) You seem to have placed it in the folder that would normally receive the files dowloded from the NCBI Taxonomy FTP site, so I suspect it's not the right one.

I am surprised you don't get any error having no genomes available in any folder... in fact it seems your run of pantagruel is stuck in the bit of task 00 that downloads the NCBI Taxonomy files. It might be that you cannot access the FTP site and that the call just hangs. You may want to test and access the NCBI FTP site using the following command: lftp -u anonymous,your@email ftp.ncbi.nlm.nih.gov If that is the problem, you may want to download the Taxonomy files by hand from the NCBI Taxonomy website and provide to pantagruel through the option -T.

I hope this helps. Florent

mattbawn commented 4 years ago

Hi Florent,

Thanks for your reply. The cluster I use is not able to connect to the internet which is why I manually downloaded the genomes.

I initiated the database with:

pantagruel -d database -r . -A genomes/ init

I have tried both -A and -a

and then tried:

pantagruel -i database/environ_pantagruel_database.sh all and pantagruel -i database/environ_pantagruel_database.sh all -T /nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/NCBI/ncbi-genomes-2019-10-23

I also tried using -T at the init command.

IU have also tried moving the genome_assemblies.tar file around but it never seems to be recognised. I have tried unpacking too but still get:

did not find the relevant taxonomy flat files in '/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/genomes/ncbi-genomes-2019-10-23/'; download the from NCBI Taxonomy FTP

Thanks,

Matt

flass commented 4 years ago

Hi Matt,

sorry I was not clear:

I think the bug you experience has nothing to do with your genome file location.

In that respect, the use of the option -A genomes/ to indicate that your archive genome_assemblies.tar is in the folder genomes/ is correct. Do not pre-extract the archive, or if you do, put directly the individual assembly folders (those located under the extracted folder ncbi-genomes-2019-10-23/) into the target folder genomes/. This is all indicated in the help page

Your problem comes from the fact that your server does not connect to the internet and that it cannot find the NCBI Taxonomy FTP server and download the taxonomy files. I edited the script so now it does not wait for the connection forever, and returns an informative error message if it could not connect (commit 224bb9b). To overcome your lack of connection, you can provide those by manually download the files, as would the error message indicate in your case (assuming you do it today so the dated folder is named Taxonomy_2019-11-06/):

ERROR: could not download the NCBI Taxonomy files; please check your network connection or download manually the files 'taxcat.tar.gz* taxcat_readme.txt taxdump.tar.gz* taxdump_readme.txt' from the FTP site ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy into folder 'R134_Pantagruel/NCBI/Taxonomy_2019-11-06/' ; exit now

So in summary, I suggest you do the following: 1) place your genome archive genome_assemblies.tar in the folder of your choice, e.g. genomes/ 2) from NCBI Taxonomy FTP ftp://ftp.ncbi.nih.gov/pub/taxonomy/, download the files taxcat.tar.gz taxcat.tar.gz.md5 taxcat_readme.txt taxdump.tar.gz taxdump.tar.gz.md5 taxdump_readme.txt into the folder of your choice, e.g. ncbi_taxonomy_2019-11-06/ 3) run the init command: pantagruel -d database -r . -A genomes -T ncbi_taxonomy_2019-11-06 -I your.email@ac.uk init 4) and then running the rest should work fine with: pantagruel -i database/environ_pantagruel_database.sh all