Guidance on confidence score

Midnighter commented 4 years ago

In the manual section on the confidence score, it is stated that

we have developed a simple scoring scheme that has yielded good results for us

but neither in that section nor in the publication do you discuss your experience. Given that you probably have the most experience with this score, are you able to provide some general guidance? I know that advice like that is a double-edged sword but I believe your insight is still helpful.

This also relates to #77 where @ktmeaton asked for the output of the score so she can form her own opinion.

AGalanis97 commented 4 years ago

Hi @Midnighter ,

for my current project I used simulations to decide on which classification I should choose (shotgun metagenomics). I used the nt database (May 8th 2020).

Turns out that kraken2 was quite good! Then, I created simulated datasets containing 65 species (genomic DNA, mitochondrial, plastid, etc; so that it resembles a shotgun metagenome related to my project) and began an investigation on the confidence scores. My goal was to get species-level resolution (and not genus or family). I evaluated all confidence scores and found out that if no confidence score is used, in my simulated dataset that contains 65 species, kraken2 identifies over 2000. At the confidence score of 0.5 I would get about 80 species and only at above 0.7 the number of species would approximate 65. However, as you increase your confidence score the sensitivity decreases and that's why I decided to use a confidence score of 0.5. The situation was similar at the genus level, but there, if I remember well, a confidence level of 0.3 was already good enough.

However, I'd like to note that at the confidence score of 0 approximately 75% of reads would be classified, while this number dropped to about 20-25% at the 0.5 score.

I would strongly advise you do something similar to identify the best confidence score for your data (and database you are going to use). If you need any help i'd be more than happy to help you out!

Kind regards, Anastasios

Midnighter commented 4 years ago

Thank you for your thoughts Anastasios! I am actually in a similar situation where I'm evaluating tools for the estimation of abundances for human gut (stool) metagenomic data. Since kraken and bracken are quite fast, I suppose it is feasible to explore this parameter range.

AGalanis97 commented 4 years ago

Yes, it's easy to evaluate them. I did the evaluation from confidence score 0 to 1 (increments of 0.1). Also make sure you evaluate the results with and without Bracken. In my specific case, Bracken was able to bring the mean % error (or difference from expected abundance) in abundance estimation closer to 0%; however, this seemed to be attributed to Bracken reporting higher abundance for many species in my simulated samples. I calculated the % error as: % error abundance = ([expected abundance] - [observed abundance])/[expected abundance]. For this reason, I didn't include Bracken in my ''pipeline''. In my case, it took approximately 1-2 hours per simulated sample to get the results from all confidence scores.

Ivarz commented 4 years ago

I'm developing a utility tool that reports confidence scores from kraken2's standard output and taxo.k2d database file. Maybe you will find it useful. It works only on Linux and FreeBSD though.

AndrewJWallace commented 4 years ago

My research group has found confidence settings below 0.5 to be problematic.

Below 0.5 we were seeing cases where the read was a poor match to anything, and where different parts of the read matched to various different types of organism, and Kraken would nonetheless classify the read according to the best-matching organism. Manual inspection of these reads tended to consistently show they were not matches, and we upped our confidence settings to 0.5 as a result, which at least has the benefit of requiring a majority-match.

In addition to this 0.5 setting we are tending toward using two additional filtering techniques, though we have not finalised these yet:

Our current standard practice is to ignore entities with 2 reads or less, and be suspicious of entities with 3-9 reads (our samples are typically ~10 million reads in size). Sensible settings for this however are probably heavily dependent on sample size and accuracy of the sequencing platform.
We are considering culling short-reads. We use 150bp Illumina reads, but sometimes seem to get ~40-50bp reads. It seems to be a clear red light when we see an organism appear with reads that are only short reads.

AGalanis97 commented 4 years ago

That's useful @AndrewJWallace! But do you remove even genera or families that have a low number of reads? Or are you referring to species level? Have you also evaluated kraken2+bracken to see whether these "weird" results dissappear?

AndrewJWallace commented 4 years ago

@AGalanis97 my group is usually only interested in the species, subspecies, and strain levels (we are primarily looking for pathogens, and usually identification of a pathogen requires resolving the identification to the species level or better because not all members of a genus or even of a species will be pathogenic), so we haven't done much work at the genus level and none at the family level.

I tried Bracken on our data, but after having high hopes was forced to discard it as unhelpful due to the relatively low number of reads we do for our samples which means we don't always get a species-level-classified read on a species that is in the sample. Bracken then forces all the family and genus level reads down onto the wrong species. Getting 0 reads at the species-level vs 1 read on a species-level can thus make a huge difference to bracken output, as this can drastically change which species Bracken forces all the reads down onto. Perhaps an NFS (edit: New Feature Suggestion) for bracken would be to add 1 to all species counts before running its algorithm, and then subtract them at the end of it, because the difference between 1 read and 0 reads is infinite when it comes to its determining which species to push genus reads down onto. Or it could implement the following algorithm which I've been manually using on some of our data to more correctly determine the probabilities:

The following is the work-around I have been looking at for us as we deal with one problematic organism for which very little of its genome is unique and where most of its reads go to the genus level rather than the species level. A genus G has two species S1 and S2 (with S2 being of particular interest, i.e. a nasty pathogen, and S1 an innocent bystander). The genome of S2 is only just over half the size of S1, but most of it is very similar to parts of S1. When Kraken builds its DB, it puts 90% of the k-mers for S2 into the G classification, as they match k-mers for both S1 and S2, and only 10% of the k-mers for S2 into an S2 classification level. Whereas for S1, half its k-mers are classified at G, and half at S1.

Given this situation, lets say we then run a sample through classifier, and the classifier reports 20 reads at G, 1 at S1, and 0 at S2. Was S2 present? Kraken says 0 reads on S2. Bracken, when invoked, piles the 20 reads from G down to S1 and says 21 reads on S1, 0 on S2. But actually, we know a sample that was only S1, would have half its reads classified S1 and half classified G. So our sample, with a 20:1 ratio of G:S1 is highly unlikely to be S1 alone, as we would expect something closer to 10:10. The sample's actually far more likely to be S2 alone, as 20:0 reads for G:S2 is highly likely given 90% of reads on S2 are classified G. Looking at the ratios of reads on G compared to S1 and S2, and knowing that 90% of S2 reads get classified G whereas only 50% of S1's reads do, suggests the most likely situation is some of both are present. If Bracken was taking these probabilities into account, the correct output of it pulling reads down to the species level would be something like: S1 3 reads, S2 18 reads. But it doesn't do that because it wrongly assumes from the zero reads on S2, that S2 absolutely isn't in the sample, and it doesn't use its knowledge of probabilities properly to estimate the likelihood that S2 was actually in the sample and had all its reads classified at other levels. Maybe this should be a NFS for Bracken.

Midnighter commented 4 years ago

Excuse me @AndrewJWallace, what is NFS in this context?

AndrewJWallace commented 4 years ago

NFS=New Feature Suggestion

jenniferlu717 commented 4 years ago

@AndrewJWallace can you email me with your suggestion? (Jennifer.lu717@gmail.com). It doesn't sound like something that would be difficult to incorporate but the reason we only stick to species that ARE identified by Kraken is because in our own samples, we penalize false positives more than false negatives if that make sense.

But regardless, it does sound like something I could make a flag for, but I would also (if you're amenable) want to work with you to see how it affects the results (negatively or positively in your use case) . Also, this probably isnt the best place to be discussing Bracken improvements (lol)

@Midnighter and @AGalanis97 We don't provide a specific confidence score default as it will vary depending on the sample and on the database. I personally default to 0.1 or 0.05 depending on the sample but that again, is for my own personal use cases. I do agree that doing a quick parameter sweep would be useful as it should not take too much more time but this is something I am constantly thinking about in terms of the best options.

We have been focusing on some other additions/improvements to Kraken2 so setting a confidence score has not been the focus but I hope all of the above discussion has been useful?

I'm going to leave this issue open until I get a chance to study confidence scores more and maybe add a FAQ section about it.

Midnighter commented 4 years ago

I've looked into this some more and based on below plots I would strongly argue for a default confidence value of 0.1 rather than 0.0. Default values get used a lot and in this case 0.0 seems rather harmful. This is for two times two technical replicates of whole-genome shotgun metagenomic sequencing of human feces.

kraken_richness_vs_confidence

Figure 1: The unique number of taxa at a specific rank (here called 'richness') assigned reads by kraken is shown as a function of the confidence parameter. We can see that there is a big drop in richness between confidence 0.0 and 0.1.

bracken_richness_vs_confidence

Figure 2: The unique number of taxa at a specific rank (here called 'richness') assigned reads by bracken is shown as a function of the confidence parameter. Please note that the vertical axis is logarithmic. We can see that there is an even more pronounced drop in richness between confidence 0.0 and 0.1 than it was the case for kraken.

jorondo1 commented 2 years ago

After about this discussion, I tested the 0.1 confidence score on saliva and gut metagenomes. Although the average % of sequenced reads barely dropped (79 to 77% in high depth samples, 92 to 84% in low depth ones), the number of identified taxa drastically dropped (5500 to 606 in low depth, 7300 to 650 in high depth samples).

Is it safe to assume that the greater part of the difference could be attributed to low abundance false positives species?

I'd love to have your input on this parameter and how to interpret its value and justify the choice. Note that I have also used Bracken re-estimations at the species level.

Looking forward to hear your thoughts, cheers !

Midnighter commented 2 years ago

@jorondo1 I mostly use --confidence 0.51 now. My argument being that I'm thus forcing the read to be assigned to the majority winner of assigned k-mers. See https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#confidence-scoring for the details (I'm sure you've read them).

jorondo1 commented 2 years ago

@Midnighter I had read it but honestly did not realise that's what it meant. This is great. Thanks a lot !

AGalanis97 commented 2 years ago

Hi @jorondo1 ,

regarding your question about false positives, in my experience yes it's low abundance false positives or closely related species. If you want you can check supplementary figures 4, 5, and 6 (https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13626) with some simulations I had done for a previous project. That's why I recommended people do the same with their expected species (if you know what it's there...). If you need help with making some simulated samples, I'm happy to give tips and advice :)

Kind regards, Anastasios

fconstancias commented 1 year ago

Excellent paper @AGalanis97

svillegasv commented 1 year ago

I just found this paper which gives some practical advise on the matter. I hope you find it useful. https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000949?crawler=true

jorondo1 commented 1 year ago

I appreciate thanks! Has anyone seen a comparison with Sourmash ??

hermidalc commented 2 months ago

If you see in the Kraken2 FAQ https://github.com/DerrickWood/kraken2/wiki/FAQ#confidence-score

The authors of Kraken2 use the default confidence score of 0 for our own projects, preferring to use the minimizer score for elimination of potential contamination.

Wondering how do they use minimizer score to filter false positives instead of the confidence score?

DerrickWood / kraken2

Guidance on confidence score #265