FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN
GNU General Public License v2.0
99 stars 43 forks source link

The huge difference between macs2 and SEACR #44

Open Jonyyqn opened 3 years ago

Jonyyqn commented 3 years ago

Hi, I did a CUT&TAG experiment on transcription factors, which included two biological replicates and an Igg control sample. I first use the default parameters of SEACR for peak calling (stringent mode, with Igg as control). In addition, I use macs2 for peak calling. The parameters are roughly as follows: macs2 callpeak -t target.bed -c Igg.bed -f BED -p 1e-10 --keep-dup all --nomodel --shift 0. The peak calling results of the two tools are very different: The peak numbers obtained from SEACR are 73 and 1. The number of peaks obtained from MACS2 is 244663 and 41345. I would like to ask why there is such a big difference above? Because I read the original article of CUT&TAG used macs2 for peak calling. In addition, is CUT&TAG suitable for experiments on transcription factor target sites? Can you help me resolve my doubts?

Jonyyqn commented 3 years ago

This is my alignment summary image

mpmeers commented 3 years ago

Hi,

In the SEACR paper we used MACS2 with the BEDPE parameter and bed files that represented full paired end fragments, in order to be directly comparable with SEACR that uses bedgraphs derived from the same. If you are using BED as a peak calling parameter and inputting bed files that only represent single-end reads rather than paired fragments, this could have an effect on how comparable it is with SEACR. Also it's unclear to me whether you're using "norm" or "non" mode--I assume "norm", but if not then it's possible there could be an issue with the background depth that prevents peaks from being called. Finally, the statistics you've shown here suggest to me that your samples are vastly oversequenced--your high rate of duplicates may influence global background estimation in a way that suppresses peak calls by SEACR, since duplicated background may be "overcounted".

Independent of those, however, in general SEACR is expected to be much more stringent since it uses a global background estimation rather than the local enrichment approach employed by MACS2. As described in the paper, this results in SEACR being more accurate at the expense of total peaks called, whereas MACS2 is less accurate but calls more peaks. You seem to be getting the extreme ends of both here, and I suspect the high duplicate rate has something to do with it, since that is likely to inflate apparent peaks with a local background approach (MACS2) and eliminate peaks using a global background approach (SEACR). You might try subsampling your fragments to see if your results align more closely. If you send a screenshot of your data and loci that you expect t be called as peaks I might be able to get a sense of how they are performing as well.

Mike

Jonyyqn commented 3 years ago

Hi,

In the SEACR paper we used MACS2 with the BEDPE parameter and bed files that represented full paired end fragments, in order to be directly comparable with SEACR that uses bedgraphs derived from the same. If you are using BED as a peak calling parameter and inputting bed files that only represent single-end reads rather than paired fragments, this could have an effect on how comparable it is with SEACR. Also it's unclear to me whether you're using "norm" or "non" mode--I assume "norm", but if not then it's possible there could be an issue with the background depth that prevents peaks from being called. Finally, the statistics you've shown here suggest to me that your samples are vastly oversequenced--your high rate of duplicates may influence global background estimation in a way that suppresses peak calls by SEACR, since duplicated background may be "overcounted".

Independent of those, however, in general SEACR is expected to be much more stringent since it uses a global background estimation rather than the local enrichment approach employed by MACS2. As described in the paper, this results in SEACR being more accurate at the expense of total peaks called, whereas MACS2 is less accurate but calls more peaks. You seem to be getting the extreme ends of both here, and I suspect the high duplicate rate has something to do with it, since that is likely to inflate apparent peaks with a local background approach (MACS2) and eliminate peaks using a global background approach (SEACR). You might try subsampling your fragments to see if your results align more closely. If you send a screenshot of your data and loci that you expect t be called as peaks I might be able to get a sense of how they are performing as well.

Mike

Thank you for your tips and suggestions. Let me list the parts you mentioned one by one:

  1. When I use SEACR, I added the parameter norm.
  2. As you said, when I use MACS2 callpeak, the input is a single-ended bed file. But I also tried to use the Sam file as input for peak calling, and the result was the same as mentioned above.
  3. You mentioned that there is a problem of high duplicates in my data. We have tested a lot for the sake of saturation. Wouldn't it reduce the amount of data if we sample the data?
  4. The following is the track display diagram of our data.

image

mpmeers commented 3 years ago

Hi,

Based on these tracks it seems like you're getting a lot of background, and therefore I think SEACR is appropriately being quite conservative with its calls--you can see that the only places peaks are called (for Target R1 I presume?) are sites where a high peak is accompanied by an extended signal block. It's important to note that, as described in both Figure 2 of the original CUT&Tag paper and Figure 2 of the updated Nature Protocols paper, we show that only 8 million fragments, and as few as 1 million fragments, are sufficient to profile the genome at high signal-to-noise ratio for histone modifications, and this should be relatively consistent for other DNA binding proteins for which high-fidelity antibodies are used. Based on your duplication rate, I'm afraid you may be combining a relatively low-specificity antibody with PCR duplication of shotgun noise from the background. Can you verify that there are binding targets that you "expect" to see in your sequencing results? In any case, I would recommend troubleshooting the CUT&Tag protocol, or perhaps trying a control antibody such as H3K27me3 that we use routinely to ensure that you're getting specificity. There's a discussion community centered on the CUT&Tag protocols.io page that might be helpful.

Mike