bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
261 stars 68 forks source link

kaiju sensitivity/specificity merged vs. unmerged paired reads #59

Closed slhogle closed 6 years ago

slhogle commented 6 years ago

Hi Peter, Thanks for the great software.

Will kaiju be more specific/sensitive for merged read pairs than for unmerged read pairs?

From my understanding kaiju translates paired reads individually, puts them in the same sorted list, and then processes that list using a backwards search on the BWT using either the MEM or greedy criterion. The classification result comes from the read with the best match to the searched database.

Are the results for each read somehow weighted if both have a best hit to the same protein? For example if read 1 has equally good matches to protein A and B and read 2 has a best match only to protein B but of a lower MEM/greedy score than read 1 would the read pair be classified as the LCA of proteins A and B or just protein B?

Cheers!

pmenzel commented 6 years ago

Hi Shane,

Will kaiju be more specific/sensitive for merged read pairs than for unmerged read pairs?

I never measured it, but I would assume that merging the mates could increase sensitivity a bit, because the read length would be increased so that an previously split protein sequence could be found after merging.

From my understanding kaiju translates paired reads individually, puts them in the same sorted list, and then processes that list using a backwards search on the BWT using either the MEM or greedy criterion. The classification result comes from the read with the best match to the searched database.

Yes that is how it works. To be more precise: Both reads are translated in all six reading frames and the resulting "amino acid fragments" are split at stop codons into smaller fragments, which are then sorted by length (MEM mode) or their BLOSUM62 score (Greedy mode). Next, the first fragment in the sorted list is searched against the BWT and the highest matching (again, determined either by length or score) subsequence of the fragment is recorded. Then the second fragment in the list is searched and so on. Eventually, the classification is based on the highest matching fragment, no matter from which mate it originates. However, amino acid fragments down in the list are not searched anymore, if they cannot yield a better match than the current best match. So it is actually possible, that only the first entry in the sorted list is searched and all others are discarded.

There is no weighting and the winner takes it all, unless two database sequences have match of same length/score, then their LCA is calculated.

all the best!

slhogle commented 6 years ago

makes sense. thanks!