bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
260 stars 68 forks source link

Contigs classification #28

Closed davidvilanova closed 7 years ago

davidvilanova commented 7 years ago

Hi, I have used a set of contigs (corresponding to a one specific bin) to assign taxonomy. The report file could be misleading if your input data are contigs or bins...

I have attached a screenshot of my xls file. I have computed the summary output file from kaiju. Kaiju says that 13,29% of reads correspond to B.subtilis. The problem is that is counting all contigs euqally, meaning if a contig alignment is bigger than then the other is not taken into account. In such case if you add the length of the contigs (column sum in attached file) you can see that B.subtilis represents 84,55% of the total length. In such case it´s a synthetic mock that contains B.subtilis so 84% is more close to the reality. I think when running contigs of other fasta sequences than reads the summary file should report the sum of length rather the the number of reads classified or release a warn since that could be misleading.

Thanks

image

pmenzel commented 7 years ago

Hi David,

yes, when examining a sample using HTS data, the contained taxa should not be quantified by contig abundance but rather by read abundance like Kaiju is used for.

I don't think that using the contig lengths for quantification is precise in practice, because the contig length after metagenomic assembly tells little about the number of reads, and thus the abundance of DNA of that species in the sample, that were used to assemble that contig.

So the best way would be to map the reads back to contigs and then use that number of mapped reads per contig for quantification. Of course the problem then is, as you have observed, that longer contigs tend to be assigned to higher-level nodes in the tree, as it becomes more likely that the best matching protein sequence in the contig can be found in more than one species in the reference database.

Actually the main reason for using programs like Kaiju is to avoid the assembly in the first place and derive species abundances directly from the reads.

davidvilanova commented 7 years ago

Thanks for the update and i understand the philosophy behind kaiju. However mapping a short read vs mapping a whole contig would probably result in lower tax resolution and species identification might become more difficult. That´s why i prefer to do taxonomy after assembly (i use megahit by the way for assembly. I would use the reads mapped to the contigs (using bwa) to get the coverage (abundance) of that contig but prefer to rely on the contig for taxonomy assignment.

pmenzel commented 7 years ago

Yes, counting the reads per contig is way better than relying at contig lengths alone.