Closed kweitemier closed 5 years ago
Yes, it seems this is due to the read hit too many places. In Centrifuge, a hit is part of the read that Centrifuge found which is exact matched to some sequence in the database. And the sum of squared adjusted hit length on a tax id is the score for that tax id. If a hit hits too many places (proportional to k), it would consume too many computational resource to find which tax id those hits matched to. In other words, even if a hit might be excluded due to the value of k, the read might still be classified because it can have other cleaner hits. In your case, the whole read is a hit, so it get excluded. In your case, maybe you can run Centrifuge with large k, and write your script to promote them to lowest ancestor. If you are only interested in the species found not the reads, you can use centrifuge-kreport, that count the reads matched to the taxonomy nodes by converting the assignment to lowest common ancestor. We will work on a script for the promoting the multi-assignment to lowest ancestor from the classification result in near future. Does this help?
Yes, thanks for the help!
I just updated the centrifuge-promote script in the repository, so it can merge the assignments into lowest common ancestor.
Thanks for the work! I'm a little confused about what is being merged now that wasn't before. I thought it was already true that if k < (# of taxa hit) then there would be merging. Is this a change to the classification file or the report-file?
centrifuge-promote is a post-processing script for classification file, which is independent and not used in the main centrifuge program. The new option in centrifuge-promote will report 1 assignment (to lca) per read.
Ah, I see, great. Is there a way to create a report file from a classification file?
There is on a centrifuge-kreport to convert the classification file to kraken-style report. entrifuge's report file has the abundance information inferred by EM algorithm, and it is more difficult to compute from post-processing since we lost some internal information when running Centrifuge.
Hello,
Could you give a little more detail into the behavior of reads being dropped as described in issue #94?
I have the following read:
This read is a perfect match to several reference sequences in the custom database I'm using, though it only matches one species (taxid 109871) and a strain within that species (taxid 403673).
When running this sequence with
-k 1
the read appears in the classification file as unclassified, and does not appear in the report.When I set
k
to be greater than (I think) the number of matching sequences in the database,-k 250
, the classifications appear in the classification file, but the report file is still empty.Is this likely a case of a read hitting "too many places" as described in #94? Is this because the read is hitting too many database sequences, too many taxa within the database, too many regions within a single reference sequence, or something else? At which step is Centrifuge flagging a sequence as being over represented, during database construction or during classification? Does the value of
k
determine whether the read is excluded? I have other reads that hit multiple species that are properly retained and classified to the most recent common ancestor even when k=1.Is there any way to control or alter this behavior? Ideally, I'd like to have a read classified to the most recent common ancestor of all the hits for that read, even if there are many hits.
Thank you for the help!