FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN
GNU General Public License v2.0
102 stars 45 forks source link

Unsure of numerical threshold #70

Open hsriniva11 opened 3 years ago

hsriniva11 commented 3 years ago

Hi, I have CUT&RUN samples but no IgG. I'm not quite sure what numerical threshold to provide. My understanding is if I give the threshold as 0.6, it reports peaks in the top 60% correct? Keeping this in mind, these are the number of peaks I get for varying thresholds. I'm not able to decide on the optimum value. How do you suggest I proceed from here?

   289457 peaks_seacr_dedup_0.1.stringent.bed
   585941 peaks_seacr_dedup_0.2.stringent.bed
   878976 peaks_seacr_dedup_0.3.stringent.bed
  1167514 peaks_seacr_dedup_0.4.stringent.bed
  1456809 peaks_seacr_dedup_0.5.stringent.bed
  1751576 peaks_seacr_dedup_0.6.stringent.bed
  2041184 peaks_seacr_dedup_0.7.stringent.bed
  2311826 peaks_seacr_dedup_0.8.stringent.bed
  2630596 peaks_seacr_dedup_0.9.stringent.bed

Also, I get the same number of peaks in the relaxed cutoff regardless of the numerical threshold

   2928987 peaks_seacr_dedup_0.1.relaxed.bed
   2928987 peaks_seacr_dedup_0.2.relaxed.bed
   2928987 peaks_seacr_dedup_0.3.relaxed.bed
   2928987 peaks_seacr_dedup_0.4.relaxed.bed
   2928987 peaks_seacr_dedup_0.5.relaxed.bed
   2928987 peaks_seacr_dedup_0.6.relaxed.bed
   2928987 peaks_seacr_dedup_0.7.relaxed.bed
   2928987 peaks_seacr_dedup_0.8.relaxed.bed
   2928987 peaks_seacr_dedup_0.9.relaxed.bed

Thanks!

mpmeers commented 3 years ago

Hi,

Apologies for the delay in responding here. You're correct that 0.6 corresponds to the top 60% of "signal blocks", i.e. regions of contiguous signal. In that sense the numeric threshold is a very simplistic approach that really just has to be adjusted based on your intuition of whether the peaks "look" real or not in a genome browser or by some other metric, just as you might do quality spot checking for any peak calling approach. One thing you might try is to plot the fraction of reads in peaks (FRiP, i.e. the percentage of reads mapping within peaks out of total reads in the sample) across different thresholds to observe whether there is an inflection point or "knee" in the curve at which you're simultaneously maximizing the FRiP while minimizing the percentile.

As for the relaxed mode issue, can you specify the exact command you're running? It appears to be a bug but currently I can't replicate the issue in SEACR v1.4 with my test files.

Mike

Naveen-Ahuja commented 1 year ago

Hi,

I am having the same issue when it comes to the relaxed mode where I am getting the same number of peaks no matter how I change the threshold values.

Thanks