DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

cutoff of score in result file #207

Open fancyge opened 3 years ago

fancyge commented 3 years ago

Hi, I notice in my results that the range of score is enormous, from 0 to 30000. I see this score (the 4th column in result file) is the weighted sum of hits but still quite understand. I see an assignment is unclassified when the score is 0, but what about other values and how can I trust them? I want to ask what is a reasonable cutoff for a score that an assignment can be trusted? Thanks a lot!

mourisl commented 3 years ago

What is your read length? If the read is long, it could reach the score 30000. For example, for a 200bp reads, if everything hits perfectly, the score could be (200-15)^2=34225. In our own experiments, we found 128 or 256 to be good cutoffs.

fancyge commented 3 years ago

Thanks a lot for the information. My read length is 150bp. I actually found much larger scores, definitely larger than (150-15)^2. Is it normal? Sorry that my file is too big for a through scan at the moment. So the score takes both the matching rate and the matching length and the ones with lower value cannot be trusted.

Can you also give me insights of the "abundance" in the summary report? I'm curious with some entries that have value 0.0. What does that mean? Thanks a lot.

mourisl commented 3 years ago

Larger score could be achieved from mate pairs, the maximal score should be 36450.

Since some reads can be assigned to multiple species equally well, Centrifuge will use EM algorithm to compute the abundance for the species. If you find value 0, it might be just the read assigned to this species might be all from multiple hits.

fancyge commented 3 years ago

Thank you! So if the species shown abundance 0, how confident can one trust the presence of that species then?

Is it proper to tell from the summary table that all these species listed are present based on those read hits? Or are there any confidence level for such conclusion? Thanks.

On Fri, Mar 19, 2021, 12:39 AM Li Song @.***> wrote:

Larger score could be achieved from mate pairs, the maximal score should be 36450.

Since some reads can be assigned to multiple species equally well, Centrifuge will use EM algorithm to compute the abundance for the species. If you find value 0, it might be just the read assigned to this species might be all from multiple hits.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DaehwanKimLab/centrifuge/issues/207#issuecomment-802550054, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRKB3DHZCLOC2DBOFDX3P3TELIQHANCNFSM4ZHXAAXA .