66% threshold rationale

Ge0rges commented 8 months ago

After looking over the preprint, I was left wondering how the 66% threshold was picked? My understanding is that modkit calculates and picks a suitable threshold on it's own, so I'm curious to know why this decision was made.

alexcritschristoph commented 8 months ago

This is a good question without a perfect answer. Note that by default the threshold for calling a site as modified is 66%, but the default for passing that site to STREME for motif calling is 90%. This allows us to identify motifs on just the highest quality sites (which should have a higher signal:noise), and then extend those detected motifs to all sites with any reasonable methylation confidence (66%).

I have no formal benchmarks of this, or for the number of reads. I just tested out a bunch and landed on something that worked well across multiple strains.

I wanted to use set thresholds for each strain to avoid the modkit threshold selection differing from strain to strain in my data. If you wanted, you could actually run modkit and then pass the chosen threshold to either percent_methylation_cutoff and/or percent_cutoff_streme.

Ge0rges commented 8 months ago

It would be a useful supplement to present a benchmark with different thresholds, or a perhaps some logic where you let modkit calculate the threshold for different samples and pick the highest one?

alexcritschristoph commented 8 months ago

Yeah, I was thinking a bit more about implementing something like this today but, I think I have decided against it. My thinking:

Fortunately, I have found that motif calling is generally robust to parameters chosen within the 50%-90% range, which modkit generally chooses. So I don't expect this to have an enormous impact.
While I'm pretty sure the modkit automatic cutoff choices are principled, I'm not sure how they interact with edge cases (either weird microbial genomes, low coverage, or something else). I wanted to avoid this possible sample-to-sample source of uncertainty. When running it I did encounter some cases where it seemed to choose a weirdly high cutoff for an individual genome, and it wasn't quite clear to me why....
Benchmarking this is tricky. Changing these cutoffs can sometimes have an effect on the motif sequence called - but how do we compare motifs? Edit distance? Something else? It's hard to think of a metric that completely captures how "right" a motif is. And my dataset is fairly limited to then test it at scale.
In general my advice here to everyone is to try out a few cutoffs and see what happens, comparing the motif output, the intermediate output from STREME, etc. For large systematic analyses of many genomes. Very curious though about cases where modkit's automatic thresholds are very different and strongly (and correctly) change the biological conclusions - I'm not sure yet what these cases may be.

Ge0rges commented 5 months ago

Could you make it an option such that a parameter can be passed to go with mod kit's estimate?

cultivarium / MicrobeMod

66% threshold rationale #3