Open kweitemier opened 4 years ago
Centrifuge looks for the pieces greedily. After getting a piece, Centrifuge will look for the next non-overlap pieces. As a result, for some shorter true seed, Centrifuge might miss that because the previous piece might extend further and contain part of the seed. So this could explain why "HM856634.1" gets score from one piece.
This happens to the "FJ809752.1 " as well, where the last piece(may be) is partly hit with 38bp in Centrifuge.
Thanks for the response! Is there any setting I can change to mitigate this, either when running centrifuge or building the index?
Has there been any progress in mitigating this when building index or running centrifuge?
Hello,
I've been looking through some classifications and am finding some that I think are incorrect, or at least I'm not understanding the scoring correctly.
I'm trying to classify the following read:
Centrifuge produces the following (with
-k 10
):These are all hits to the genus Margaritifera. However, when BLASTing this sequence, the top hit in NCBI is accession FJ809752.1 to the organism Venustaconcha ellipsiformis (which, incidentally, makes more sense for this sample).
BLAST gives the following for the Venustaconcha ellipsiformis accession FJ809752.1:
And the following for a hit to HM856634.1, which appears in the Margaritifera list above:
For the exact matches to accession HM856634.1 there is a 22 bp piece and a 43 bp piece. My understanding is that this should be calculated as
((22-15)^2) + ((43-15)^2) = 833
, but Centrifuge reports the score as 784 (which equals(43-15)^2
).For the exact matches to accession FJ809752.1 there is a 46 bp piece and a 55 bp piece (there are also 18 and 15 bp pieces, but my understanding is the default for
--min-hitlen
drops anything under 22bp). So, the scoring for this should be((46-15)^2) + ((55-15)^2) = 2561
.I'm able to exclude the Margaritifera taxids and recover a hit to Venustaconcha ellipsiformis, but Centrifuge reports the score as only 529 (I can't determine where this comes from).
Am I misunderstanding how the Centrifuge scoring works?
Thanks for the help!
I'm using Centrifuge 1.0.4-beta, 64-bit: