bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
260 stars 68 forks source link

makeDB does not create kaiju_db.fmi #8

Closed hudenise closed 8 years ago

hudenise commented 8 years ago

Hi, I'm new to Kaiju and would like to test it for our purpose as alternative to Qiime. The installation was straight forward (thanks!) however when downloading the NCBI database, the kaiju_db.fmi file is not created, meaning that I'm unable to run kaiju My call was: $../bin/makeDB.sh -v and the screen ouput looks OK: Extracting protein sequences from downloaded files... Creating Borrows-Wheeler transform...

infilename= kaiju_db.faa

outfilename= kaiju_db

Alphabet= ACDEFGHIKLMNPQRSTVWY

nThreads= 5

length= 0.000000

checkpoint= 3

caseSens=OFF

revComp=OFF

term= *

revsort=OFF

help=OFF

Sequences read time = 86.800000s SLEN 5926785889 NSEQ 18269638 ALPH *ACDEFGHIKLMNPQRSTVWY SA NCHECK=0 Sorting done, time = 5752.120000s

$ ls assembly_summary.archaea.txt assembly_summary.bacteria.txt downloadlist.txt genomes kaiju_db.bwt kaiju_db.faa kaiju_db.sa names.dmp nodes.dmp taxdump.tar.gz

pmenzel commented 8 years ago

I just tried it here and makeDB.sh went through smoothly. It seems that in your case, the last step (mkfmi) was skipped.

The full output should look like this:

$ makeDB.sh -v
Downloading taxonomy files from NCBI  
2016-08-08 11:59:12 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [2751] -> ".listing" [1]
2016-08-08 12:00:58 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [36832801] -> "taxdump.tar.gz" [1]
Extracting nodes.dmp and names.dmp files
Creating directory genomes/
Downloading file list for full genomes...
2016-08-08 12:01:02 URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt [154723] -> "assembly_summary.archaea.txt" [1]
2016-08-08 12:01:52 URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt [18037645] -> "assembly_summary.bacteria.txt" [1]
Downloading 5336 genome files from GenBank FTP server. This may take a while...
2016-08-08 12:01:55 URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000007725.1_ASM772v1/GCF_000007725.1_ASM772v1_genomic.gbff.gz [410498] -> "genomes/GCF_000007725.1_ASM772v1_genomic.gbff.gz" [1]
...
2016-08-08 14:43:28 URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000827835.1_ASM82783v1/GCF_000827835.1_ASM82783v1_genomic.gbff.gz [2769976] -> "genomes/GCF_000827835.1_ASM82783v1_genomic.gbff.gz" [1]
Downloading virus genomes from GenBank FTP server...
2016-08-08 14:50:12 URL: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.genomic.gbff.gz [160469055] -> "genomes/viral.1.genomic.gbff.gz" [1]
2016-08-08 14:51:09 URL: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.genomic.gbff.gz [18433279] -> "genomes/viral.2.genomic.gbff.gz" [1]
Extracting protein sequences from downloaded files...
Creating Borrows-Wheeler transform...
# infilename= kaiju_db.faa
# outfilename= kaiju_db
# Alphabet= ACDEFGHIKLMNPQRSTVWY
# nThreads= 5
# length= 0.000000
# checkpoint= 3
# caseSens=OFF
# revComp=OFF
# term= *
# revsort=OFF
# help=OFF
Sequences read time = 111.470166s
SLEN 5926791275
NSEQ 18269649
ALPH *ACDEFGHIKLMNPQRSTVWY
SA NCHECK=1
Sorting done,  time = 8727.987023s
Creating FM-Index...
# filenm= kaiju_db
# removecmd= NULL (null)
# help=OFF
Reading BWT from file kaiju_db.bwt ... DONE
BWT of length 5762364424 has been read with 18269649 sequencs, alphabet=*ACDEFGHIKLMNPQRSTVWY
Reading suffix array from file kaiju_db.sa ... DONE
Writing BWT header and SA to file  kaiju_db.fmi ... DONE
Constructing FM index
10% ... 20% ... 30% ... 40% ... 50% ... 60% ... 70% ... 80% ... 90% ... 100% ... index2 done ...
DONE
Writing FM index to file ... DONE

  !!  You can now delete files kaiju_db.bwt and kaiju_db.sa  !!

Done!
You can delete the folder genomes/ as well as the files taxdump.tar.gz, kaiju_db.faa, kaiju_db.bwt, and kaiju_db.sa
Kaiju only needs the files kaiju_db.fmi, nodes.dmp, and names.dmp.

You could try to just repeat the last step with $../bin/mkfmi kaiju_db

hudenise commented 8 years ago

Thanks, indeed my first output was looking quite short compare to yours. However I just tested the command you send me: still no failure message but files are still missing: ../bin/mkfmi kaiju_db

filenm= kaiju_db

removecmd= NULL (null)

help=OFF

Reading BWT from file kaiju_db.bwt ... DONE BWT of length 5762359137 has been read with 18269638 sequencs, alphabet=*ACDEFGHIKLMNPQRSTVWY Reading suffix array from file kaiju_db.sa ...

[hudenise@ebi6-209 kaijudb]$ ls -ltr total 16199688 -rw-r--r-- 1 hudenise metagen 102157973 Aug 4 15:20 nodes.dmp -rw-r--r-- 1 hudenise metagen 130600015 Aug 4 15:20 names.dmp -rw-rw-r-- 1 hudenise metagen 5979563210 Aug 4 16:52 kaiju_db.faa -rw-rw-r-- 1 hudenise metagen 4653666088 Aug 4 17:14 kaiju_db.sa -rw-rw-r-- 1 hudenise metagen 5665780084 Aug 4 17:14 kaiju_db.bwt -r--r--r-- 1 hudenise metagen 36813213 Aug 5 09:20 taxdump.tar.gz -rw-rw-r-- 1 hudenise metagen 154723 Aug 5 11:12 assembly_summary.archaea.txt -rw-rw-r-- 1 hudenise metagen 582300 Aug 5 11:12 downloadlist.txt -rw-rw-r-- 1 hudenise metagen 18037645 Aug 5 11:12 assembly_summary.bacteria.txt drwxrwxr-x 2 hudenise metagen 1093632 Aug 5 11:28 genomes

Hubert

On 08/08/2016 15:46, Peter Menzel wrote:

I just tried it here and makeDB.sh went through smoothly. It seems that in your case, the last step (mkfmi) was skipped.

The full output should look like this:

$ makeDB.sh -v
Downloading taxonomy files from NCBI
2016-08-08 11:59:12 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [2751] -> ".listing" [1]
2016-08-08 12:00:58 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [36832801] -> "taxdump.tar.gz" [1]
Extracting nodes.dmp and names.dmp files
Creating directory genomes/
Downloading file list for full genomes...
2016-08-08 12:01:02 URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt [154723] -> "assembly_summary.archaea.txt" [1]
2016-08-08 12:01:52 URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt [18037645] -> "assembly_summary.bacteria.txt" [1]
Downloading 5336 genome files from GenBank FTP server. This may take a while...
2016-08-08 12:01:55 URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000007725.1_ASM772v1/GCF_000007725.1_ASM772v1_genomic.gbff.gz [410498] -> "genomes/GCF_000007725.1_ASM772v1_genomic.gbff.gz" [1]
...
2016-08-08 14:43:28 URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000827835.1_ASM82783v1/GCF_000827835.1_ASM82783v1_genomic.gbff.gz [2769976] -> "genomes/GCF_000827835.1_ASM82783v1_genomic.gbff.gz" [1]
Downloading virus genomes from GenBank FTP server...
2016-08-08 14:50:12 URL: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.genomic.gbff.gz [160469055] -> "genomes/viral.1.genomic.gbff.gz" [1]
2016-08-08 14:51:09 URL: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.genomic.gbff.gz [18433279] -> "genomes/viral.2.genomic.gbff.gz" [1]
Extracting protein sequences from downloaded files...
Creating Borrows-Wheeler transform...
# infilename= kaiju_db.faa
# outfilename= kaiju_db
# Alphabet= ACDEFGHIKLMNPQRSTVWY
# nThreads= 5
# length= 0.000000
# checkpoint= 3
# caseSens=OFF
# revComp=OFF
# term= *
# revsort=OFF
# help=OFF
Sequences read time = 111.470166s
SLEN 5926791275
NSEQ 18269649
ALPH *ACDEFGHIKLMNPQRSTVWY
SA NCHECK=1
Sorting done,  time = 8727.987023s
Creating FM-Index...
# filenm= kaiju_db
# removecmd= NULL (null)
# help=OFF
Reading BWT from file kaiju_db.bwt ... DONE
BWT of length 5762364424 has been read with 18269649 sequencs, alphabet=*ACDEFGHIKLMNPQRSTVWY
Reading suffix array from file kaiju_db.sa ... DONE
Writing BWT header and SA to file  kaiju_db.fmi ... DONE
Constructing FM index
10% ... 20% ... 30% ... 40% ... 50% ... 60% ... 70% ... 80% ... 90% ... 100% ... index2 done ...
DONE
Writing FM index to file ... DONE

   !!  You can now delete files kaiju_db.bwt and kaiju_db.sa  !!

Done!
You can delete the folder genomes/ as well as the files taxdump.tar.gz, kaiju_db.faa, kaiju_db.bwt, and kaiju_db.sa
Kaiju only needs the files kaiju_db.fmi, nodes.dmp, and names.dmp.

You could try to just repeat the last step with $../bin/mkfmi kaiju_db


You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/bioinformatics-centre/kaiju/issues/8#issuecomment-238260388

Dr Hubert DENISE

Metagenomics European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom Tel : (+44)01223 494102

pmenzel commented 8 years ago

Hm, so it means that mkfmi stops on

Reading suffix array from file kaiju_db.sa ...

without an explicit errror message? And it doesn't reach the point

Reading suffix array from file kaiju_db.sa ... DONE

?

I don't see what could happen there and I assume you have enough RAM, since mkbwt completed successfully, at least 16GB? Otherwise maybe you could try a different machine..

In the meantime, you can use the index files that are used by the web server http://kaiju.binf.ku.dk/database/kaiju_index.tgz, which contains the kaiju_db.fmi and *.dmp files.

hudenise commented 8 years ago

Ok, I increased the memory to 16G and the fmi file is now being created. Thanks Hubert

On 08/08/2016 16:03, Peter Menzel wrote:

Hm, so it means that mkfmi stops on

Reading suffix array from file kaiju_db.sa ...

without an explicit errror message? And it doesn't reach the point

Reading suffix array from file kaiju_db.sa ... DONE

?

I don't see what could happen there and I assume you have enough RAM, since mkbwt completed successfully, at least 16GB? Otherwise maybe you could try a different machine..

In the meantime, you can use the index files that are used by the web server http://kaiju.binf.ku.dk/database/kaiju_index.tgz, which contains the kaiju_db.fmi and *.dmp files.


You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/bioinformatics-centre/kaiju/issues/8#issuecomment-238265869

Dr Hubert DENISE

Metagenomics European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom Tel : (+44)01223 494102

pmenzel commented 8 years ago

Good to hear!

hudenise commented 8 years ago

Sorry, I've got another question: Running '../bin/kaiju -t nodes.dmp -f kaiju_db.fmi -i ../testsequence.fasta' gave me a list of classified and unclassified sequences with taxId associated with the classified ones. Fine. However I should be able to save this output by adding the '-o output.txt' option to my call. Unfortunately, the output.txt is not created and no data are saved. Would you be able to advise, please? Thanks Hubert

On 08/08/2016 16:56, Peter Menzel wrote:

Closed #8.


You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/bioinformatics-centre/kaiju/issues/8#event-748598612

Dr Hubert DENISE

Metagenomics European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom Tel : (+44)01223 494102

pmenzel commented 8 years ago

Hi, that is again quite strange. You have enough disk space free and permission to create the file? You could try the option -v for a bit more verbose output, which should look like this:

./bin/kaiju -t nodes.dmp -f kaiju_db.fmi -i test.fa -v -o output.txt
10:21:26 Reading database
 Reading taxonomic tree from file nodes.dmp
 Reading index from file kaiju_db.fmi
Output file: output.txt
10:21:45 Start classification using 1 threads.
10:21:45 Finished.
hudenise commented 8 years ago

It seems that I've got it working by giving 16G of memory instead of the 8G I used by default. It would be nice to get an error message for this kind of issue. Would you be able to let us know if the memory requirement increase with the size of the query set or does it stay more or less constant? Thanks Hubert

On 09/08/2016 09:25, Peter Menzel wrote:

Hi, that is again quite strange. You have enough disk space free and permission to create the file? You could try the option -v for a bit more verbose output, which should look like this:

./bin/kaiju -t nodes.dmp -f kaiju_db.fmi -i test.fa -v -o output.txt
10:21:26 Reading database
  Reading taxonomic tree from file nodes.dmp
  Reading index from file kaiju_db.fmi
Output file: output.txt
10:21:45 Start classification using 1 threads.
10:21:45 Finished.

You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/bioinformatics-centre/kaiju/issues/8#issuecomment-238486750

Dr Hubert DENISE

Metagenomics European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom Tel : (+44)01223 494102

pmenzel commented 8 years ago

Good that it works now. In principle Kaiju needs a bit more memory than the size of the fmi file. (which gets bigger due to increasing number of genomes in GenBank). The size of your input fasta/q file does not affect the memory requirements.

hudenise commented 8 years ago

Great, thanks for your timely responses. Much appreciated, cheers Hubert

On 09/08/2016 09:50, Peter Menzel wrote:

In principle Kaiju needs a bit more memory than the size of the fmi file. (which gets bigger due to increasing number of genomes in GenBank). The size of your input fasta/q file does not affect the memory requirements

Dr Hubert DENISE

Metagenomics European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom Tel : (+44)01223 494102

tbazilegith commented 2 years ago

Hi everyone, I was creating the Kaiju database, the .fmi was not created and I my kaijudb directory look like this: merged.dmp names.dmp nodes.dmp refseq taxdump.tar.gz and the refseq subdirectory contents are: assembly_summary.archaea.txt assembly_summary.bacteria.txt downloadlist.txt kaiju_db_refseq.faa source Should I run on to create the .fmi ? I'm unsure whether I should start over. I tried kaiju-mkfmi kaijudb, but I got

filenm= kaijudb/

removecmd= NULL (null)

help=OFF

ERROR: File kaijudb/.bwt containing BWT could not be opened for reading Thanks, TJ

PeterCx commented 2 years ago

I am also experiencing the same issue. What was the fix? Many thanks

Hi everyone, I was creating the Kaiju database, the .fmi was not created and I my kaijudb directory look like this: merged.dmp names.dmp nodes.dmp refseq taxdump.tar.gz and the refseq subdirectory contents are: assembly_summary.archaea.txt assembly_summary.bacteria.txt downloadlist.txt kaiju_db_refseq.faa source Should I run on to create the .fmi ? I'm unsure whether I should start over. I tried kaiju-mkfmi kaijudb, but I got

filenm= kaijudb/

removecmd= NULL (null)

help=OFF

ERROR: File kaijudb/.bwt containing BWT could not be opened for reading Thanks, TJ