Automatic threshold determination

ckuenne commented 4 years ago

Hi,

is there a systematic way to find meaningful cutoffs for -s/-w? How do you decide on these?

Best, Carsten

VJalili commented 4 years ago

That's a good question!

The choice of stringent (-s) and weak (-w) thresholds are subject to your experiment setup and peak calling pipeline.

You may consider this example for choosing thresholds: Based on your experiment setup, you may decide to use, e.g., 1e-8 as the p-value threshold on MACS so to minimize the number of false-positives. In order to increase the number of true-positives without the penalty of significantly increasing the number of false-positives, we suggest you use a more permissive threshold for peak calling (e.g., 1e-4), and then apply mspc to rescue weak but reproducible regions. Specifically, using 1e-4 on MACS will increase the number of true-positives with the penalty of more false-positives, then you can run MSPC on the replicates with -w 1e-4 and -s 1e-8, which will rescue the reproducible weak binding sites (i.e., increase the number of true-positives without increasing the number of false-positives). Additionally, in this setup, you may choose -g 1e-6 for a more permissive process; by default -g (combined stringency) is set to be equal to -s.

You may need to choose different thresholds if you use a different peak caller or different experiment setup.

We are interested in developing a method that can suggest values for these thresholds based on given input. Do you have any particular method in mind?

ping @marziacremona @fernandoPalluzzi

ckuenne commented 4 years ago

Not really. I'm working with multiple setups including ChIP-Seq (histones, tf), ATAC-Seq, RIP-Seq, and so on, which result in different peak sizes, distributions, and signal-to-noise ratio. Peak calling is done with MACS2, EPIC2, and MUSIC. If you plot these peaks after sorting for ascending -log10(fdr) you get something like this (example from using MACS2 with q-value 1e-4 on ATAC-Seq resulting in ~80k peaks).

A generalized algorithm would have to dynamically find lower/upper bounds here, which I have no good idea about. You could look for the slope, i.e. where it changes from linear to exponential. Or area under the curve, but that basically amounts to percentiles, which seems kinda crude.

BTW: in my experience the "extreme" peaks are often artefacts that result from erroneously calling multiple peaks as one, or are located in weird regions that should really be blacklisted. I know you can't fix issues with the peak calling itself, this should be done before your software is run. Just saying that the super peaks are often fishy, despite being highly significant and reproducible.

Edit: I just included your two suggested thresholds.

VJalili commented 4 years ago

This is a good suggestion, we can give it a try. We can work on it together if you're interested.

Regarding the second point: Yes, agree. We discussed excluding black listed regions (https://github.com/Genometric/MSPC/issues/85), probably we can prioritize it.

marziacremona commented 4 years ago

The automatic thresholding would be indeed a cool improvement for MSPC, but it’s very hard to do since it depends a lot on the particular experiment and on the peak caller used. It’s no surprise that peak callers don’t do that, but they just propose a default fixed threshold… so I don’t think this should be a priority.

Regarding the extreme peaks, excluding ENCODE blacklisted regions as a first MSPC step would be certainty useful, and not complex to do.

ckuenne commented 4 years ago

I agree that it's probably impossible to automatically find the perfect threshold for every experimental variation automatically. But you might be able to identify certain types of distributions that behave similarly. And an imperfect auto threshold option would already be a great help. Just something to start off from a not totally arbitrary point.

Considering the ENCODE blacklisted regions: these were already removed from my example peaks. I guess that most people do this as part of the normal peak calling. But it sure can't hurt to have the option here.

VJalili commented 4 years ago

@ckuenne I agree suggesting a value for these thresholds is important. I think it would be a very useful addition; would you be interested leading its method development and implementation? I can help you as much needed.

Some questions probably we can start with:

Can a value be suggested for these thresholds based on the info in the input BED files (e.g., p-value)? I guess we'd prefer doing our best to avoid asking users for any additional files/information.
I guess we want to suggest values for the weak (-w) and stringent (-s) thresholds; it may also worth suggesting a value for the optional combined stringency (-g) threshold .

ckuenne commented 4 years ago

Well, if I had a viable idea for this I would probably just have written a small wrapper and that's that. Sadly, I don't have any. And this is not really something i can devote myself to right now. Sorry.

VJalili commented 4 years ago

Thanks for the suggestion anyway :) I'd keep issue open for a while in case anyone comes up a method.

Genometric / MSPC

Automatic threshold determination #126