Reads excluded due to too may hits?

kweitemier commented 5 years ago

Hello,

Could you give a little more detail into the behavior of reads being dropped as described in issue #94?

I have the following read:

@M01380:82:000000000-BJ5K8:1:1101:23618:1981
ATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAAACCTTTGTGAATCATTAAATCTTTGAACGCACATTGCACTCGTAAAAGAGTATACATGTTTGAGAATTATAAAAATACATTGTC
+
GGGGGGGGGGDFGGGGGGGGGGGGGGGGGGFGGGGGFGGGFCFGGFGGGGGGGGFEFGGGFFFGGFFEFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFGGGGFGGGGGGGGGGGGGGGGGG

This read is a perfect match to several reference sequences in the custom database I'm using, though it only matches one species (taxid 109871) and a strain within that species (taxid 403673).

When running this sequence with -k 1 the read appears in the classification file as unclassified, and does not appear in the report.

readID  seqID   taxID   score   2ndBestScore    hitLength       queryLength     numMatches
M01380:82:000000000-BJ5K8:1:1101:23618:1981     unclassified    0       0       0       0       126     1

When I set k to be greater than (I think) the number of matching sequences in the database, -k 250, the classifications appear in the classification file, but the report file is still empty.

readID  seqID   taxID   score   2ndBestScore    hitLength       queryLength     numMatches
M01380:82:000000000-BJ5K8:1:1101:23618:1981     FJ232011.1      109871  12321   12321   126     126     215
M01380:82:000000000-BJ5K8:1:1101:23618:1981     JN870755.1      109871  12321   12321   126     126     215
M01380:82:000000000-BJ5K8:1:1101:23618:1981     AATT01000344.1  403673  12321   12321   126     126     215
M01380:82:000000000-BJ5K8:1:1101:23618:1981     JN870760.1      109871  12321   12321   126     126     215
M01380:82:000000000-BJ5K8:1:1101:23618:1981     JQ582934.1      109871  12321   12321   126     126     215
.
.
.
(About 200 lines, but with only those two taxa included)

Is this likely a case of a read hitting "too many places" as described in #94? Is this because the read is hitting too many database sequences, too many taxa within the database, too many regions within a single reference sequence, or something else? At which step is Centrifuge flagging a sequence as being over represented, during database construction or during classification? Does the value of k determine whether the read is excluded? I have other reads that hit multiple species that are properly retained and classified to the most recent common ancestor even when k=1.

Is there any way to control or alter this behavior? Ideally, I'd like to have a read classified to the most recent common ancestor of all the hits for that read, even if there are many hits.

Thank you for the help!

mourisl commented 5 years ago

Yes, it seems this is due to the read hit too many places. In Centrifuge, a hit is part of the read that Centrifuge found which is exact matched to some sequence in the database. And the sum of squared adjusted hit length on a tax id is the score for that tax id. If a hit hits too many places (proportional to k), it would consume too many computational resource to find which tax id those hits matched to. In other words, even if a hit might be excluded due to the value of k, the read might still be classified because it can have other cleaner hits. In your case, the whole read is a hit, so it get excluded. In your case, maybe you can run Centrifuge with large k, and write your script to promote them to lowest ancestor. If you are only interested in the species found not the reads, you can use centrifuge-kreport, that count the reads matched to the taxonomy nodes by converting the assignment to lowest common ancestor. We will work on a script for the promoting the multi-assignment to lowest ancestor from the classification result in near future. Does this help?

kweitemier commented 5 years ago

Yes, thanks for the help!

mourisl commented 5 years ago

I just updated the centrifuge-promote script in the repository, so it can merge the assignments into lowest common ancestor.

kweitemier commented 5 years ago

Thanks for the work! I'm a little confused about what is being merged now that wasn't before. I thought it was already true that if k < (# of taxa hit) then there would be merging. Is this a change to the classification file or the report-file?

mourisl commented 5 years ago

centrifuge-promote is a post-processing script for classification file, which is independent and not used in the main centrifuge program. The new option in centrifuge-promote will report 1 assignment (to lca) per read.

kweitemier commented 5 years ago

Ah, I see, great. Is there a way to create a report file from a classification file?

mourisl commented 5 years ago

There is on a centrifuge-kreport to convert the classification file to kraken-style report. entrifuge's report file has the abundance information inferred by EM algorithm, and it is more difficult to compute from post-processing since we lost some internal information when running Centrifuge.

DaehwanKimLab / centrifuge

Reads excluded due to too may hits? #149