FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN
GNU General Public License v2.0
102 stars 45 forks source link

SEACR failing with IgG control #43

Open akmorrow13 opened 4 years ago

akmorrow13 commented 4 years ago

Hello,

I am trying to run SEACR with an IgG control. I get the error:

++ awk '{s+=$3-$2; t++}END{print s/(t*10)}' M5aetxQ1x2qc7.auc.threshold.bed
awk: cmd. line:1: fatal: division by zero attempted

M5aetxQ1x2qc7.auc.threshold.bed seems to be empty, which is why the error is thrown.

My command is run as follows: bash SEACR_1.3.sh BamReadyForAnalysis.bdg IgG_BamReadyForAnalysis.bdg non stringent output/SEACR

Is there any ideas for why the file is empty? I ran the command without the IgG control and cutoff of 0.01, and this worked fine.

Any help would be appreciated!

mpmeers commented 4 years ago

Hi,

The "auc.threshold.bed" file results from applying the global threshold assigned by SEACR to the concatenated bedgraph (with suffix "auc.bed"). If the threshold is set high enough that no peaks meet it, it's possible that an empty file would be returned. This may be the result of using "non" mode, in which the IgG bedgraph file is not normalized to the target bedgraph file; if the IgG signal is uniformly high enough it would cause a very high threshold to be returned, and perhaps result in an empty "auc.threshold.bed" file. Have you considered using "norm" mode, wherein the IgG should be scaled to the target and therefore an empty file should not occur?

Mike

akmorrow13 commented 4 years ago

Thank you for your response, @mpmeers! The problem was actually related to my bam file not being sorted, thus bamtobed was not returning the right bed file, and there were no peaks found (thus the failure). Upon sorting the bam, it does not fail. It may be good to throw some error when this occurs stating that no peaks were found.

mpmeers commented 4 years ago

Hi,

Ah that makes sense--thanks for looking into that, and you're right that a flag there would be a good idea; I'll implement it in my next update. Thank you for the feedback.

Mike

akmorrow13 commented 4 years ago

Just a follow up, upon running with an IgG control, many of my samples have 0 peaks (for some samples, we go from 60,000 peaks with a 0.01 threshold without an IgG control to 0 relaxed peaks using a control). Is this expected? I would still expect to see at least some peaks with the control.

mpmeers commented 4 years ago

Hi,

This is possible since using a numeric threshold simply takes the top user-defined percentile of signal blocks in the target file regardless of the likelihood that it is background or not, whereas when IgG info is incorporated many of those called "peaks" may be considered below the background threshold. SEACR is quite stringent in this regard even in "relaxed" mode, so in particular if the IgG file exhibits a lot of background it may be difficult to get lots of peaks called.

It sounds like there are some obvious peaks or expected "true positives" that are not being called? It's possible that there's an issue with using "non" mode that I alluded to in the first response, so trying "norm" mode could help if the IgG signal level is high (although in my experience the opposite is often the case). Alternatively, testing out some more stringent numeric thresholds (e.g. 0.005-0.0001) might capture likely true positives while filtering out most of what is likely to be background. Finally, if the background issue is intractable, it might be worth testing a peak caller such as Macs2 that uses local background calibration, and therefore is less stringent and liable to call more peaks.

Mike

akmorrow13 commented 4 years ago

Thanks for the explanation @mpmeers. I tried running with both norm and non mode with the IgG control, and got the same number of peaks for each run. Below is an example figure showing the results for MACS2 and SEACR with multiple configurations. I might expect the peak on the left in the red box to be called with the background, as it does look significant. However, when using the IgG control this peak is not called.

Screen Shot 2020-09-30 at 12 13 43 PM

In summary, I find the following when comparing MACS2 and SEACR:

  1. SEACR can better identify the exact location of histone CUT&RUN peaks. MACS2 does not do a good job identifying the entire peak, even in broad mode.
  2. The SEACR background model may be too simplistic to identify peaks in which there is subtle IgG peaks.
  3. Without a control, SEACR identifies many spurious peaks and leaves it up to the user to define a threshold that must be tuned for each sample.

It may be that point (2) is a product of not having good samples, however. Are the bam files you used in the SEACR publication available anywhere so that I can re-run this analysis to compare results to my current samples?

mpmeers commented 4 years ago

Hi,

Most of the data we used in the SEACR paper are available here. I've updated SEACR a few times since those analyses were run, however, so there may be some minor differences.

I think your summary is generally accurate. On point 2, it's unclear to me whether the peak on the left is not called because it overlaps a small IgG peak or rather because it doesn't meet the threshold. The former is a decision I made for the default behavior when designing SEACR, in the interest of being as selective as possible with the peaks that are returned, so if that's the case it unfortunately can't be remedied unless I rewrite some of the code to avoid throwing out those peaks. If it's the latter instead, it could be an issue of data quality, but I have also found that there are a few edge cases where the peak calling threshold gets miscalibrated even with decent data, particularly when the IgG and target read depths are very different. If there are clearly hundreds of peaks visible that don't get called, I'd be happy to investigate if you wanted to send me a sample of your data privately.

Mike