SEACR missing lots of obvious peaks

Lillianwu0314 commented 2 years ago

Hi all,

I ran CUT&RUN experiments with 200.000 cells per experiment and H3K27me3 (CST #9733 antibody) being my positive control and IgG being my negative control.
These were all sequenced at 150x150bp with 10-20 million reads each.
I aligned the reads to the new CHM13-T2T genome with Bowtie2, checked the bam duplication rate and they were all <50% so I did not remove duplicates.
I generated bedgraphs from the bam files with bedtools genomecov (I did not normalise with my spike-in as it's not required for my set of experiments).
I then called peaks using SEACR with IgG as the background and in 'norm' mode.
Looking at the peaks called, it seems like SEACR fails to call peaks in many sites where there is obvious enrichment in H3K27me3.

IGV H3K27me3 HOXC cluster Here's an IGV screenshot of the HOXC gene cluster. Blue: H3K27me3 (top) and IgG (bottom) bedgraphs Green: Peaks called with MACS2 using different p and q-value cut-offs (I used my bam files as input and -f BAMPE) Red: Peaks called with SEACR on relaxed (top) and stringent (bottom) mode

What could possibly be causing this? I'm double checking all my scripts in case I've made a silly mistake.

Thanks for taking the time to read this. Lillian

Lillianwu0314 commented 2 years ago

Hi all,

Adding to this, I also did the same analysis with reads aligned to the hg38 assembly. The peaks called with SEACR make a lot more sense. Here's a screenshot of the HOXC gene cluster with peaks called by SEACR relaxed (top) and stringent (bottom) mode. IGV H3K27me3 HOXC cluster hg38

Thanks, Lillian

mpmeers commented 2 years ago

Hello,

It's possible that the T2T assembly is going to cause there to be some differences in the global background estimation, since large pileups at difficult-to-map repeated regions might be differently represented now that there are presumably many more available reference sites at which ambiguous reads could map. My understanding is that the bowtie2 default (in contrast with -k or -a mode) is to assign the read semi-randomly to a single map location when there are multiple possible map locations with the same quality score (described in more detail here). Are you using bowtie2 here, and if so is it doing the default map position reporting? Does removing duplicates make a difference? This is something I will probably need to take a closer look at since it could be a generalized behavior with the T2T builds, which are unlike anything we've used before.

Mike

Lillianwu0314 commented 2 years ago

Hi Mike,

Thanks for the quick reply and explanation. I did try removing duplicates from the IgG (duplication rate ~50%) but not H3K27me3 (duplication 13%) and then running SEACR with that. It doesn't seem to improve peak calling unfortunately.

I did use Bowtie2 with the default setting. I'll have a look into -k and -a and see if that's appropriate for what I'm looking into. I've also considered using --non-deterministic so that identical reads don't get aligned to the same places if there are multiple equally good alignments.

Thanks, Lillian

clabanillas commented 6 months ago

Hello,

I've been having the same issue for a while now. I also run on T2T aligned reads. Many peaks are missed and peak calling is not uniform between replicates:

Has there been an update on this? Thanks a lot, Claudia

FredHutch / SEACR

SEACR missing lots of obvious peaks #87