Closed salaheenz closed 6 years ago
They have rearranged the file structure of all the RefSeq genomes on the ftp server. The modified paths to the necessary *.tar.gz files (which you can manually update in download_genomic_library.sh) are:
(modify line 43): $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz (modify line 59): $FTP_SERVER/genomes/archive/old_refseq/Plasmids/plasmids.all.fna.tar.gz
These two paths will allow you to download the standard bacteria and plasmid databases, but I'm not sure about the proper updated path for downloading the viruses database.
Thanks!!
Hi, after modifying the lines, Bacterial and Viral DB were downloaded but not the plasmid, Fungi (which I added separately), or Humans, instead it starts the building step. Any suggestions for that?
I'd love to help, but it's difficult without more specific information. If you enter the plasmid ftp path in your browser, does it take you to a list of assemblies? I'd double check that you have the proper path for the plasmids ftp directory.
I would familiarize yourself with the new NCBI ftp directory structure for these archived versions of the assemblies and see if you can locate the proper paths to the archived fungi and human assemblies.
I modified the command this way and run, no errors were found but did not get plasmid, fungi or human databases:
case "$1" in "bacteria") mkdir -p $LIBRARY_DIR/Bacteria cd $LIBRARY_DIR/Bacteria if [ ! -e "lib.complete" ] then rm -f all.fna.tar.gz wget $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz echo -n "Unpacking..." tar zxf all.fna.tar.gz rm all.fna.tar.gz echo " complete." touch "lib.complete" else echo "Skipping download of bacterial genomes, already downloaded here." fi ;;
"Fungi") mkdir -p $LIBRARY_DIR/Fungi cd $LIBRARY_DIR/Fungi if [ ! -e "lib.complete" ] then rm -f all.fna.tar.gz wget $FTP_SERVER/genomes/archive/old_refseq/Fungi/all.fna.tar.gz echo -n "Unpacking..." tar zxf all.fna.tar.gz rm all.fna.tar.gz echo " complete." touch "lib.complete" else echo "Skipping download of fungal genomes, already downloaded here." fi ;;
"plasmids") mkdir -p $LIBRARY_DIR/Plasmids cd $LIBRARY_DIR/Plasmids if [ ! -e "lib.complete" ] then rm -f plasmids.all.fna.tar.gz wget $FTP_SERVER/genomes/archive/old_refseq/Plasmids/plasmids.all.fna.tar.gz echo -n "Unpacking..." tar zxf plasmids.all.fna.tar.gz rm plasmids.all.fna.tar.gz echo " complete." touch "lib.complete" else echo "Skipping download of plasmids, already downloaded here." fi ;;
"viruses") mkdir -p $LIBRARY_DIR/Viruses cd $LIBRARY_DIR/Viruses if [ ! -e "lib.complete" ] then rm -f all.fna.tar.gz rm -f all.ffn.tar.gz wget $FTP_SERVER/genomes/Viruses/all.fna.tar.gz wget $FTP_SERVER/genomes/Viruses/all.ffn.tar.gz echo -n "Unpacking..." tar zxf all.fna.tar.gz tar zxf all.ffn.tar.gz rm all.fna.tar.gz rm all.ffn.tar.gz echo " complete." touch "lib.complete" else echo "Skipping download of viral genomes, already downloaded here." fi ;;
"human") mkdir -p $LIBRARY_DIR/Human cd $LIBRARY_DIR/Human if [ ! -e "lib.complete" ] then
wget --spider --no-remove-listing $FTP_SERVER/genomes/H_sapiens/
directories=$(perl -nle '/^d/ and /(CHR_\w+)\s*$/ and print $1' .listing)
rm .listing
# For each CHR_* directory, get GRCh* fasta gzip file name, d/l, unzip, and add
for directory in $directories
do
wget --spider --no-remove-listing $FTP_SERVER/genomes/H_sapiens/$directory/
file=$(perl -nle '/^-/ and /\b(hs_ref_GRCh\w+\.fa\.gz)\s*$/ and print $1' .listing)
[ -z "$file" ] && exit 1
rm .listing
wget $FTP_SERVER/genomes/H_sapiens/$directory/$file
gunzip "$file"
bhaley@NextSeq-Server:/mnt/data/bhaley/Results/kraken_dir$ sudo ./kraken-build --standard --db /mnt/data/bhaley/Results/kraken_DB
Found jellyfish v1.1.11
Skipping download of bacterial genomes, already downloaded here.
Skipping download of viral genomes, already downloaded here.
Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
Hash size not specified, using '11634429519'
K-mer set created. [1h3m14.378s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [4h41m27.198s]
Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [2m41.191s]
Creating seqID to taxID map (step 5 of 6)...
214486 sequences mapped to taxa. [40.859s]
Setting LCAs in database (step 6 of 6)...
Finished processing 214798 sequences
Database LCAs set. [2h46m41.733s]
Database construction complete. [Total: 8h34m45.604s]
bhaley@NextSeq-Server:/mnt/data/bhaley/Results/kraken_dir$
Since I'm only a Kraken user and not a developer, I can't diagnose your exact issues here. But I'm guessing that the standard installation is not going to recognize the extra "fungi" download that you specified. You'll need to double check your changes to the download_genomic_library.sh script, but after you're sure that they are ok, I would try manually specifying the libraries you want to download, as follows:
./kraken-build --download-library bacteria --db kraken_DB ./kraken-build --download-library plasmids --db kraken_DB ./kraken-build --download-library fungi --db kraken_DB ./kraken-build --download-library viruses --db kraken_DB ./kraken-build --download-library human --db kraken_DB
No guarantees that those will all work, but it's worth a try. Once those are all there, do: kraken-build --download-taxonomy --db kraken_DB kraken-build --build --db kraken_DB
Doesn't support fungi; worked for plasmid but not for human, will work for the time being.... thanks a lot!!
This post from Mick Watson might help?
http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/
I just tried to install Kraken and this still seems to be a problem. Is kraken thus dead software and no longer maintained ?
Sorry for the late response. We are working on updating the download scripts so that they allow downloading of mouse and other refseq genomes. In the meantime, I would download the genomes using wget or rsync and add them using the kraken --add-to-library option which is described in the Kraken manual.
Hi Derrick,
Can you please help on the following issue:
bhaley@NextSeq-Server:/$ ./kraken-build --standard --db //kraken_DB Found jellyfish v1.1.11 --2016-03-08 17:40:58-- ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz => ‘all.fna.tar.gz’ Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 130.14.250.11, 2607:f220:41e:250::10 Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|130.14.250.11|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /genomes/Bacteria ... No such directory ‘genomes/Bacteria’.