FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN
GNU General Public License v2.0
102 stars 45 forks source link

Sensitive parameter #11

Open abhijitcbio opened 5 years ago

abhijitcbio commented 5 years ago

Hi, I am trying to call peaks using SEACR using the following parameter settings

./SEACR_1.1.sh treatement.bdg control.bdg norm relaxed out

but I am getting very few peaks for some comparison. I was trying to debug the reason and looked at your SEACR_1.1.R script file. I found that if I change the value from 0.9 (3 peaks) to 1 (8K) in the following codes then I am getting peaks.

ctrlvalue<-sort(ctrlvec)[as.integer(0.9length(ctrlvec))] ## Added 7/15/19 to improve memory performance expvalue<-sort(expvec)[as.integer(0.9length(expvec))] ## Added 7/15/19 to improve memory performance

I am wondering if it would be fine to use 1 rather than 0.9 here?

mpmeers commented 5 years ago

Hi Abhijit,

The purpose of that particular line is to remove outliers before plotting the density distribution of signal blocks that SEACR uses to normalize control bedgraph values to the target. Specifically, using 0.9 throws out the top 10% of signal block values, since those outlier values at the high end tend to be an order of magnitude or more larger than the core distribution of values. Using the high end outliers causes the density calculations to become very coarse (since R can only calculate so many values across the distribution, and therefore spaces them evenly), and thus the resulting normalization is less accurate.

To illustrate, in Figure 2B of the SEACR paper there are only 9 Sox2 peaks called in DE cells where Sox2 shouldn't be expressed. When I make the change you suggested above and re-test that data, it results in thousands of peaks being spuriously called for Sox2, likely because the normalization isn't properly calibrated.

Because of what I outlined above, I generally think it isn't a good idea to fully relax the maximum values for density calculation. However, it may be possible to change that value within a reasonable range and still improve your results. For instance, when I did the same Sox2 test after changing 0.9 to 0.99, it still only called single-digit peaks, suggesting that proper normalization is still occurring. Maybe you could try changing the value to 0.99 and see if it works for your data? If it does, I can do some more internal tests to determine whether it's worth changing that value in the core code.

Thanks for your note, and let me know how it goes.

Mike

mpmeers commented 5 years ago

Hi Abhijit,

In the most recent release (v1.2), I have changed the thresholding described above so that it automatically selects a maximum threshold that still succeeds in filtering out outlier values, which should satisfy your needs. I'd be interested to see if you recover more peaks using v1.2 over v1.1, so if you manage to test your data please let me know how it goes.

Mike

YichaoOU commented 3 years ago

Hello,

Using v1.3, I'm still getting around 10 peaks (treatment.bdg control.bdg norm relaxed), while macs2 default parameter got 2000+ peaks.

Wondering if there is a parameter I can change?

Thanks, Yichao

mpmeers commented 3 years ago

Hi Yichao,

Apologies for the delay in responding. Currently I can't recommend anything unfortunately, unless your experimental and control bedgraphs are scaled to some measure of externally added spike-in DNA (such as E. coli as described here), in which case you could try to use "non" mode instead of "norm" mode, which may yield peaks with some more success. I'm actively working on the internal normalization utility though, so in a subsequent version you might try again to see if things are improved.

Mike

YichaoOU commented 3 years ago

Thanks! looking forward to the new version release.