bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
272 stars 66 forks source link

can kaiju be used for assembled contigs? #73

Closed chloelulu closed 6 years ago

chloelulu commented 6 years ago

Hi,

May I know is it suitable to do taxonomy prediction for assembled contigs from megahit in kaiju? Thanks in advance.

pmenzel commented 6 years ago

Hi, yes, you can also give contigs as input to Kaiju instead of sequencing reads. The taxon assignment will be based on the best matching subsequence.

chloelulu commented 6 years ago

I see. Thanks so much for the answer.

chloelulu commented 6 years ago

Hi, Peter, I want to follow up a question. If I want to predict the prokaryote taxonomy of the long assembled contigs. Which database is better? Since I have tried the kaiju in k-base on my raw un-assembled raw reads on the different databases, it gives minor different results. And it only gives me the summary of taxonomy, no result of specific reads and its taxonomy. If I run it on cmd line, will I have each contig and its corresponding taxonomy? Thanks so much.

pmenzel commented 6 years ago

Hi again,

yes, Kaiju will output a tab-delimted file, which contains read name (or contig name in your case) and the NCBI taxon ID of the taxon that is assign to that read/contig.
If you also need the actual names, you can use the program addTaxonNames, which will append the names on each line in the output file.

chloelulu commented 6 years ago

Hi, Thanks for the explanation. I am still confusing on the result. I used a customed database, which contains 9314 virus genome. I make the database based on https://github.com/bioinformatics-centre/kaiju#custom-database Then I use codekaijux -f /home/virus/RefGenome/kaiju_db/proteins.faa.fmi -i ../MEGAHIT_C1_1.contigs.fa -o kaijux_MEGAHIT_C1_1_greedy -z 16 -a greedy -e 0. As a result, it gives me the result table as attached. May I have some suggestions on how to deal with this situation? Millions of thanks. kaijux_MEGAHIT_C1_1_greedy_virus.txt

pmenzel commented 6 years ago

Ok, so here you are running kaijux instead of kaiju. Kaijux is used for querying a database without doing taxonomy classification and therefore it returns the database IDs for the best match(es), which are your accession numbers in the fourth column of the output file.

If you want to do taxonomy classification, you either need to modify your custom database and include the taxon id of each sequence in the sequence names in the database, or you need to post-process the kaijux output file and find the taxon id for the returned accession numbers, e.g., using one of the files on the NCBI FTP server.
For example, kaiju uses the file prot.accession2taxid when creating a database from the BLAST NR file in order to map protein accession numbers to taxa.

chloelulu commented 6 years ago

Hi, I see. That makes more sense to me. I want to do taxonomy classification on my assembled reads with my customed virus reference genome database. I misused kaijux. Could you explain more on "modify your custom database and include the taxon id of each sequence in the sequence names in the database". May I know how to achieve that goal? In the same manner as in ncbi blast? And I found that my reference virus genomes are all in nucleotide format, shall I need to change them?

pmenzel commented 6 years ago

Well, Kaiju uses protein databases, so yes, you need to have protein sequences in your custom database :-)

Regarding the taxon IDs, the database names must end with an underscore followed by the taxon id, for example:

>WP_003131952.1_1358
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQN
>XP_642131.1_44689
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ
...

If you just want to have viruses in the database and can live with the ones that are in NCBI RefSeq, you can do like this:

wget -N -nv ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xf taxdump.tar.gz nodes.dmp names.dmp
mkdir -p genomes
wget -N -nv -P genomes ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.genomic.gbff.gz
wget -N -nv -P genomes ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.genomic.gbff.gz
/path/to/kaiju/bin/gbk2faa.pl genomes/viral.1.genomic.gbff.gz genomes/viral.1.genomic.faa
/path/to/kaiju/bin/gbk2faa.pl genomes/viral.2.genomic.gbff.gz genomes/viral.2.genomic.faa
cat genomes/viral.1.genomic.faa  genomes/viral.2.genomic.faa >kaiju_db.faa
/path/to/kaiju/bin/mkbwt -n 5 -e 5 -a ACDEFGHIKLMNPQRSTVWY -o kaiju_db kaiju_db.faa
/path/to/kaiju/bin/mkfmi kaiju_db

which gives you the file kaiju_db.fmi and you can use it together with nodes.dmp in standard kaiju.

chloelulu commented 6 years ago

Millions of thanks. I will try it. Too detailed! Awesome!

pmenzel commented 6 years ago

I just modified makeDB.sh so that it is possible to run makeDB.sh -v without any other parameters. This will only download the virus sequences from RefSeq, which will then be in kaiju_db.fmi.

chloelulu commented 6 years ago

Awesome! Thanks so much. :-)

lalalagartija commented 5 years ago

Hi, I have the same question. I want to submit my contigs to kaiju. But what do you mean by "The taxon assignment will be based on the best matching subsequence" ? Would it be a better choice to submit my contigs or to submit the ORFs and then filter manually for the right taxon ? Thanks !

pmenzel commented 5 years ago

You can submit the individual ORFs as protein sequences and then assign the taxonomy to the contig depending on the most prevalent assigned taxon from all ORFs in the contig.

If you were to submit the entire contig nucleotide sequence, then kaiju will only output one taxon for that sequence, which is the one with the best database match. That would be using only one protein from all the proteins in the contig and it could happen that this best match is not in line with the majority of all other matches from the other proteins. You can try both approaches and check out how often they agree..