DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
250 stars 73 forks source link

Centrifuge incorrect scoring? #187

Open kweitemier opened 4 years ago

kweitemier commented 4 years ago

Hello,

I've been looking through some classifications and am finding some that I think are incorrect, or at least I'm not understanding the scoring correctly.

I'm trying to classify the following read:

@M03023:329:000000000-C94WG:1:1102:15671:1966 1:N:0:MA_R2_1_5:AllSEMussels_16S_200+201 TAGAGTCTGT|0|AAAAAAAAAA|0 200|0|20|
GAACTCCCCGAACTTTTACCCCTATACACAGAATATTTAGGCTACCAGGTTGGTTGGGGCAACCCTGGAACAACTTCAACTTCCAACTTACAAAGTAAAACATGAAAAACTTGATACGGACCACAATTGCAGTTACCCCGGGGATAACAGCGTAATCCAACTTAAGAGTACGC
+
;0FAFFFFFFB:AFGHHBDFCFF/F1F2FGD>1D1DGHB21BFBFFBFBGGG?EG111??<1FGEHGB?1B???GHGB<FGHHGB2FGHG1BB12FBBBBFAGBBB222CGHGFFBHGFF>B2FFGGGB=22FGBGFGGB/2D1FD1C@1DACBAEFFBACFGD2FB0FFBBB

Centrifuge produces the following (with -k 10):

M03023:329:000000000-C94WG:1:1102:15671:1966    AY579085.1      52361   784     784     43      173     7
M03023:329:000000000-C94WG:1:1102:15671:1966    species 47531   784     784     43      173     7
M03023:329:000000000-C94WG:1:1102:15671:1966    HM856634.1      52361   784     784     43      173     7
M03023:329:000000000-C94WG:1:1102:15671:1966    AY579084.1      52361   784     784     43      173     7
M03023:329:000000000-C94WG:1:1102:15671:1966    species 288132  784     784     43      173     7
M03023:329:000000000-C94WG:1:1102:15671:1966    U72545.1        52361   784     784     43      173     7
M03023:329:000000000-C94WG:1:1102:15671:1966    NC_015476.1     52361   784     784     43      173     7

These are all hits to the genus Margaritifera. However, when BLASTing this sequence, the top hit in NCBI is accession FJ809752.1 to the organism Venustaconcha ellipsiformis (which, incidentally, makes more sense for this sample).

BLAST gives the following for the Venustaconcha ellipsiformis accession FJ809752.1:

Query  1     GAACTCCCCGAACTTTTACCCCTATACACAGAATATTTAGGCTA-CCAGGTTGGTTGGGG  59
             |||||||| |||| |||||||||||||||||| |||| | |||| |||||||||||||||
Sbjct  6791  GAACTCCCTGAACCTTTACCCCTATACACAGA-TATTAAAGCTACCCAGGTTGGTTGGGG  6849

Query  60    CAACCCTGGAACAACTTCAACTTCCAACTTACAAAGTAAAACATGAAAAACTTGATACGG  119
             ||||||||||||||||||||||||||||||| | | ||||| ||||||||||||||| ||
Sbjct  6850  CAACCCTGGAACAACTTCAACTTCCAACTTATACAATAAAATATGAAAAACTTGATAAGG  6909

Query  120   ACCACAATTGCAGTTACCCCGGGGATAACAGCGTAATCCAACTTAAGAGTACG  172
             |||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  6910  ACCACAATTGCAGTTACCCCGGGGATAACAGCGTAATCCAACTTAAGAGTACG  6962

And the following for a hit to HM856634.1, which appears in the Margaritifera list above:

Query  47    AGGTTGGTTGGGGCAACCCTGGAACAACTTCAACTTCCA-ACTT-ACAAAGTAAAACATG  104
             || |||||||||||||||||||||| | |  | ||||||   || || ||| || |||| 
Sbjct  6778  AGTTTGGTTGGGGCAACCCTGGAACCAATAAAGCTTCCAGTATTAACTAAGAAACACATC  6837

Query  105   AAAAACTTGATACGGACCACAATTGCAGTTACCCCGGGGATAACAGCGTAATCCAACTTA  164
             |||||| |  | ||||| | ||  | ||||||||||||||||||||||||||||||||||
Sbjct  6838  AAAAACAT-CTTCGGACTA-AAAAGAAGTTACCCCGGGGATAACAGCGTAATCCAACTTA  6895

Query  165   AGAGTACGC  173
             |||||||||
Sbjct  6896  AGAGTACGC  6904

For the exact matches to accession HM856634.1 there is a 22 bp piece and a 43 bp piece. My understanding is that this should be calculated as ((22-15)^2) + ((43-15)^2) = 833, but Centrifuge reports the score as 784 (which equals (43-15)^2).

For the exact matches to accession FJ809752.1 there is a 46 bp piece and a 55 bp piece (there are also 18 and 15 bp pieces, but my understanding is the default for --min-hitlen drops anything under 22bp). So, the scoring for this should be ((46-15)^2) + ((55-15)^2) = 2561.

I'm able to exclude the Margaritifera taxids and recover a hit to Venustaconcha ellipsiformis, but Centrifuge reports the score as only 529 (I can't determine where this comes from).

M03023:329:000000000-C94WG:1:1102:15671:1966    NC_013659.1     301928  529     529     38      173     2
M03023:329:000000000-C94WG:1:1102:15671:1966    FJ809752.1      301928  529     529     38      173     2

Am I misunderstanding how the Centrifuge scoring works?

Thanks for the help!

I'm using Centrifuge 1.0.4-beta, 64-bit:

Sun Oct  7 21:18:18 PDT 2018
Compiler: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
Options: -O3 -m64 -msse2 -funroll-loops -g3 -std=c++11 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
mourisl commented 4 years ago

Centrifuge looks for the pieces greedily. After getting a piece, Centrifuge will look for the next non-overlap pieces. As a result, for some shorter true seed, Centrifuge might miss that because the previous piece might extend further and contain part of the seed. So this could explain why "HM856634.1" gets score from one piece.

This happens to the "FJ809752.1 " as well, where the last piece(may be) is partly hit with 38bp in Centrifuge.

kweitemier commented 4 years ago

Thanks for the response! Is there any setting I can change to mitigate this, either when running centrifuge or building the index?

sanderdebacker commented 3 years ago

Has there been any progress in mitigating this when building index or running centrifuge?