DerrickWood / kraken

Kraken taxonomic sequence classification system
http://ccb.jhu.edu/software/kraken/
GNU General Public License v3.0
212 stars 104 forks source link

No such directory ‘genomes/Bacteria’. #40

Closed salaheenz closed 6 years ago

salaheenz commented 8 years ago

Hi Derrick,

Can you please help on the following issue:

bhaley@NextSeq-Server:/$ ./kraken-build --standard --db //kraken_DB Found jellyfish v1.1.11 --2016-03-08 17:40:58-- ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz => ‘all.fna.tar.gz’ Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 130.14.250.11, 2607:f220:41e:250::10 Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|130.14.250.11|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /genomes/Bacteria ... No such directory ‘genomes/Bacteria’.

jbeaulaurier commented 8 years ago

They have rearranged the file structure of all the RefSeq genomes on the ftp server. The modified paths to the necessary *.tar.gz files (which you can manually update in download_genomic_library.sh) are:

(modify line 43): $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz (modify line 59): $FTP_SERVER/genomes/archive/old_refseq/Plasmids/plasmids.all.fna.tar.gz

These two paths will allow you to download the standard bacteria and plasmid databases, but I'm not sure about the proper updated path for downloading the viruses database.

salaheenz commented 8 years ago

Thanks!!

salaheenz commented 8 years ago

Hi, after modifying the lines, Bacterial and Viral DB were downloaded but not the plasmid, Fungi (which I added separately), or Humans, instead it starts the building step. Any suggestions for that?

jbeaulaurier commented 8 years ago

I'd love to help, but it's difficult without more specific information. If you enter the plasmid ftp path in your browser, does it take you to a list of assemblies? I'd double check that you have the proper path for the plasmids ftp directory.

I would familiarize yourself with the new NCBI ftp directory structure for these archived versions of the assemblies and see if you can locate the proper paths to the archived fungi and human assemblies.

salaheenz commented 8 years ago

I modified the command this way and run, no errors were found but did not get plasmid, fungi or human databases:

case "$1" in "bacteria") mkdir -p $LIBRARY_DIR/Bacteria cd $LIBRARY_DIR/Bacteria if [ ! -e "lib.complete" ] then rm -f all.fna.tar.gz wget $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz echo -n "Unpacking..." tar zxf all.fna.tar.gz rm all.fna.tar.gz echo " complete." touch "lib.complete" else echo "Skipping download of bacterial genomes, already downloaded here." fi ;;

"Fungi") mkdir -p $LIBRARY_DIR/Fungi cd $LIBRARY_DIR/Fungi if [ ! -e "lib.complete" ] then rm -f all.fna.tar.gz wget $FTP_SERVER/genomes/archive/old_refseq/Fungi/all.fna.tar.gz echo -n "Unpacking..." tar zxf all.fna.tar.gz rm all.fna.tar.gz echo " complete." touch "lib.complete" else echo "Skipping download of fungal genomes, already downloaded here." fi ;;

"plasmids") mkdir -p $LIBRARY_DIR/Plasmids cd $LIBRARY_DIR/Plasmids if [ ! -e "lib.complete" ] then rm -f plasmids.all.fna.tar.gz wget $FTP_SERVER/genomes/archive/old_refseq/Plasmids/plasmids.all.fna.tar.gz echo -n "Unpacking..." tar zxf plasmids.all.fna.tar.gz rm plasmids.all.fna.tar.gz echo " complete." touch "lib.complete" else echo "Skipping download of plasmids, already downloaded here." fi ;;

"viruses") mkdir -p $LIBRARY_DIR/Viruses cd $LIBRARY_DIR/Viruses if [ ! -e "lib.complete" ] then rm -f all.fna.tar.gz rm -f all.ffn.tar.gz wget $FTP_SERVER/genomes/Viruses/all.fna.tar.gz wget $FTP_SERVER/genomes/Viruses/all.ffn.tar.gz echo -n "Unpacking..." tar zxf all.fna.tar.gz tar zxf all.ffn.tar.gz rm all.fna.tar.gz rm all.ffn.tar.gz echo " complete." touch "lib.complete" else echo "Skipping download of viral genomes, already downloaded here." fi ;;

"human") mkdir -p $LIBRARY_DIR/Human cd $LIBRARY_DIR/Human if [ ! -e "lib.complete" ] then

get list of CHR_* directories

  wget --spider --no-remove-listing $FTP_SERVER/genomes/H_sapiens/
  directories=$(perl -nle '/^d/ and /(CHR_\w+)\s*$/ and print $1' .listing)
  rm .listing
  # For each CHR_* directory, get GRCh* fasta gzip file name, d/l, unzip, and add
  for directory in $directories
  do
    wget --spider --no-remove-listing $FTP_SERVER/genomes/H_sapiens/$directory/
    file=$(perl -nle '/^-/ and /\b(hs_ref_GRCh\w+\.fa\.gz)\s*$/ and print $1' .listing)
    [ -z "$file" ] && exit 1
    rm .listing
    wget $FTP_SERVER/genomes/H_sapiens/$directory/$file
    gunzip "$file"

bhaley@NextSeq-Server:/mnt/data/bhaley/Results/kraken_dir$ sudo ./kraken-build --standard --db /mnt/data/bhaley/Results/kraken_DB Found jellyfish v1.1.11 Skipping download of bacterial genomes, already downloaded here. Skipping download of viral genomes, already downloaded here. Kraken build set to minimize disk writes. Creating k-mer set (step 1 of 6)... Found jellyfish v1.1.11 Hash size not specified, using '11634429519' K-mer set created. [1h3m14.378s] Skipping step 2, no database reduction requested. Sorting k-mer set (step 3 of 6)... K-mer set sorted. [4h41m27.198s] Creating GI number to seqID map (step 4 of 6)... GI number to seqID map created. [2m41.191s] Creating seqID to taxID map (step 5 of 6)... 214486 sequences mapped to taxa. [40.859s] Setting LCAs in database (step 6 of 6)... Finished processing 214798 sequences
Database LCAs set. [2h46m41.733s] Database construction complete. [Total: 8h34m45.604s] bhaley@NextSeq-Server:/mnt/data/bhaley/Results/kraken_dir$

jbeaulaurier commented 8 years ago

Since I'm only a Kraken user and not a developer, I can't diagnose your exact issues here. But I'm guessing that the standard installation is not going to recognize the extra "fungi" download that you specified. You'll need to double check your changes to the download_genomic_library.sh script, but after you're sure that they are ok, I would try manually specifying the libraries you want to download, as follows:

./kraken-build --download-library bacteria --db kraken_DB ./kraken-build --download-library plasmids --db kraken_DB ./kraken-build --download-library fungi --db kraken_DB ./kraken-build --download-library viruses --db kraken_DB ./kraken-build --download-library human --db kraken_DB

No guarantees that those will all work, but it's worth a try. Once those are all there, do: kraken-build --download-taxonomy --db kraken_DB kraken-build --build --db kraken_DB

salaheenz commented 8 years ago

Doesn't support fungi; worked for plasmid but not for human, will work for the time being.... thanks a lot!!

flashton2003 commented 7 years ago

This post from Mick Watson might help?

http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/

xapple commented 7 years ago

I just tried to install Kraken and this still seems to be a problem. Is kraken thus dead software and no longer maintained ?

jenniferlu717 commented 6 years ago

Sorry for the late response. We are working on updating the download scripts so that they allow downloading of mouse and other refseq genomes. In the meantime, I would download the genomes using wget or rsync and add them using the kraken --add-to-library option which is described in the Kraken manual.