DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
237 stars 73 forks source link

Behavior of --exclude-taxids #150

Closed kweitemier closed 5 years ago

kweitemier commented 5 years ago

Hello,

I've been working with Centrifuge using a custom database and have been using --exclude-taxids to disregard several taxa in the database. My understanding was that --exclude-taxids would essentially behave as though the listed taxa were not in the database. That is, if a read hits taxon A with a high score and hits taxon B with a slightly lower score, and k=1, then taxon A would be reported. But, if taxon A was excluded then the only hit would be to taxon B and it would be reported.

I have a read that I'm testing that, when k is large, produces the following classification output.

readID  seqID   taxID   score   2ndBestScore    hitLength       queryLength     numMatches
M01380:82:000000000-BJ5K8:1:1101:18601:4608     NC_004743.1     7904    71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     NC_021757.1     111304  71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     KC820796.1      111304  71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     KX276660.1      111304  71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     KF668288.1      1421439 71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     KX276659.1      111304  71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     KF150287.1      111304  71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     KC905169.1      111304  71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     AF404840.1      111304  71824   71824   283     283     10
M01380:82:000000000-BJ5K8:1:1101:18601:4608     AB042837.1      7904    71824   71824   283     283     10

But, when I try the same command excluding taxon 1421439 I get the following:

readID  seqID   taxID   score   2ndBestScore    hitLength       queryLength     numMatches
M01380:82:000000000-BJ5K8:1:1101:18601:4608     KC905169.1      111304  71824   71824   283     283     3
M01380:82:000000000-BJ5K8:1:1101:18601:4608     AF404840.1      111304  71824   71824   283     283     3
M01380:82:000000000-BJ5K8:1:1101:18601:4608     NC_021757.1     111304  71824   71824   283     283     3

And when I try excluding only taxon 111304:

readID  seqID   taxID   score   2ndBestScore    hitLength       queryLength     numMatches
M01380:82:000000000-BJ5K8:1:1101:18601:4608     unclassified    0       0       0       0       283     1

Is this demonstrating expected behavior?

This is with 1.0.4 beta, 64-bit Compiler: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) Options: -O3 -m64 -msse2 -funroll-loops -g3 -std=c++11 -DPOPCNT_CAPABILITY Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

I'd also like to ask if a description of the fields in the metrics file is available somewhere.

Thanks for any insights you can provide.

mourisl commented 5 years ago

In your case of 111304, the hits are the full reads, in other words, there is no hit from Centrifuge onto other taxonomy ids. So this case is slightly different from the "A,B" example, where Centrifuge finds two different hits. In Centrifuge, the searching hits stage does not exclude any genome even with --exclude-taxids options. It then filters the hits that matched in the node or children of the tax ids from --exclude-taxids. Does this help?

kweitemier commented 5 years ago

Thanks for the quick response! It looks as though there are full read hits to taxa 7904, 111304, and 1421439 that all receive the same score. If I exclude one of these taxa, shouldn't all the hits to the others still be reported?

For example, when I exclude 1421439 I should still see seven hits to 111304 and two hits to 7904. Similarly, when I exclude 111304 I should still see the hits to 1421439 and 7904, correct?

(Note that these three taxa are at the same level in the taxonomy tree, the parent node for each of them is taxon 7901.)

mourisl commented 5 years ago

You are right. What you found is indeed a bug in Centrifuge! I've uploaded the new file in github which should fix this issue. Thanks for letting us know about this.