FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN
GNU General Public License v2.0
99 stars 43 forks source link

big difference in number of stringent and relaxed peaks between without trimming reads and after trimming reads #38

Open pdubey2018 opened 4 years ago

pdubey2018 commented 4 years ago

Hi,

I am new to data analysis, we have been doing CnT for histone modifications and transcription factor. First I used the following alignment parameters: --local --very-sensitive-local --no-unal --no-mixed --no-discordant

then used the following script to generate fragment bedgraph: output

samtools sort -n treatment_sorted.bam -o treatment_sq_sorted.bam
samtools fixmate treatment_sq_sorted.bam treatment_matefixed.bam
samtools view -bf 0x2 treatment_matefixed.bam | bedtools bamtobed -i stdin -bedpe > treatment_properpair.bed
bedtools sort -i treatment_properpair.bed > treatment_properpair_chr_sorted.bed
awk '$6-$2 < 1000 {print $1"\t"$2"\t"$6 }' treatment_properpair_chr_sorted.bed
bedtools genomecov -bg -i ${output}_final_start_end.bed -g chrlen > treatment.fragment.bedgraph

and finally

bash SEACR_1.3.sh treatment.fragment.bedgraph IgG.fragment.bedgraph norm stringent treatment_prefix

With stringent mode, I got 96,431 peaks, relaxed I got 271,594 peaks. I repeated the analysis after trimming my reads -

fastp -h treatment.html -i treatment_1.fq.gz -I treatment_2.fq.gz -o treatment_trim_1.fq.gz -O treatment_trim_2.fq.gz

and aligned with

--end-to-end --very-sensitive --no-unal --no-mixed --no-discordant

. Now I get 1 peak with stringent mode and 0 peak with relaxed.

for some histone modifications, the number of peaks was reduced by 90%.

I am trying to understand what is going on here. Any help is highly appreciated.

Thank you

pdubey2018 commented 4 years ago

Some more details- After trimming my treatment1 file has 24 million paired reads while IgG has 12 million. I am not sure what would explain such a drastic decrease in the number of peaks by SEACR after trimming. I would be happy to provide more details. Any help is highly appreciated. Thank you.

mpmeers commented 4 years ago

Hello,

Am I interpreting this correctly in assuming you are trimming your reads for adapter sequence? If so, what are the length of your reads? I would typically advise against this in all cases in which your reads are 50 bp or shorter since the reads are highly unlikely to read into the adapter sequence. For reference, we find that consistently 1-4% of our paired end fragments map to a size shorter than 50 bp, meaning that there is a high likelihood that "overtrimming" will occur in which reads that don't actually contain adapter sequence are spuriously trimmed and subsequently omitted during mapping due to ambiguity introduced by the trimming. This then has a compound effect on paired-end mapping since the mate pairs of those trimmed reads that would otherwise map uniquely can no longer be mapped into coherent fragments for the purposes of SEACR (where unaligned pairs are thrown out during preprocessing). It's possible that this downstream loss of library complexity due to overtrimming could have led to your result.

I'm not familiar with fastp, but it seems like in addition to adapter trimming it accomplishes quality trimming, which will further reduce the number of read pairs that map concordantly--this would be consistent with your result that IgG has fewer reads remaining since IgG libraries are naturally less complex than target libraries and might be liable to contain more low quality fastq scores. Since SEACR uses a global threshold derived from aggregated bedgraph signal blocks, it's quite sensitive to library complexity, and therefore will tend to be very restrictive in peak calling when the target and IgG libraries are equally non-complex due to either poor experimental quality or significant removal of reads.

How many paired end fragments were omitted in the trimmed vs. untrimmed cases? If a large proportion were thrown out, I would review the criteria for trimming to determine whether validly mapped pairs are being inappropriately thrown out. In general, if we do quality thresholding, we do it at the level of paired-end fragments post-mapping (hence the requirement for uniquely mapping read pairs) rather than individual reads pre-mapping. Let me know if this is helpful at all.

Mike

pdubey2018 commented 4 years ago

Hi Mike,

Thank you for your detailed response. It is really helpful to understand the process. Yes, I wanted to remove adapter sequence, hence did the end to end trimming. Also, my colleague argues that end to end alignment forces stricter control to alignment. Hence wanted to try and see for any improvement in my peaks. To help you better understand the case, see below various metrics before and after trimming:

image

Below is the description of the score column of the final bedgraph file of Igg and treatment, before and after trimming. There is an increase in proper pair aligned reads after trimming - about 1.5 million in Igg and 3 million in Brd4. And overall a little decrease in mean score in bedgraph. Looks like this decrease is enough to so dramatically reduce number of peaks in relaxed and stringent? or something else?

image

thank you again,

Pankaj

mpmeers commented 4 years ago

Hi Pankaj,

I see that your reads are quite long, so trimming is indeed appropriate. For reference, we generally only sequence to 25 bp paired end reads, and unless you are interested in short repeat regions I would argue that that's sufficient, and avoids any issues with adapter trimming.

It also appears that your mapping is actually improving (there's an increase in the proper pairs after trimming), so I was likely incorrect in thinking that many concordant pairs are being thrown out. Does "Fragment number in bedgraph" reflect the length of the bedgraph file? This could indicate that more background regions are being introduced, meaning it's possible that you're seeing the opposite effect I expected, i.e. that background was not properly accounted for in the untrimmed case and now there is background introduced that tips the scales towards a restrictive global threshold for peak calling. I can't be sure of this without seeing the data though--if you examine your data in a genome browser, do the untrimmed bedgraphs look "cleaner" (i.e. fewer background peaks) than the untrimmed ones?

Independent of the reason for the trimmed/untrimmed discrepancy, SEACR can sometimes assign background improperly if the read depth of the IgG is significantly different than the target, which is the case here (2:1 ratio). One remedy to this is to use a spike-in calibration to manually scale the signal between target and IgG and then use SEACR in "non" rather than "norm" mode, so if you added an exogenous DNA spike-in to your CUT&RUN or used a pA-MNase or pAG-MNase that has residual E. coli DNA (i.e. self-purified or obtained from a non-commercial source), you could simply multiply bedgraph signal by a constant divided by the number of reads mapping to the spike-in genome, and try the SEACR non mode.

Let me know if this is helpful at all.

Mike

pdubey2018 commented 4 years ago

Hi Mike, thanks for helping out. Yes, the fragment number is the length of bedgraph file. My Suprise even with a bit increase in noise in the genome, lead to such a drastic reduction in the number of peaks being called. Our IgG is not as clean as we would like, as shown by your lab.

However, I tried normalization with the Ecoli-properly paired reads. Basically divided the 4th columns by no. of reads mapped to E.coli for both control and treatment file. Ecoli_normalization of treatment and control and then calling peaks in stringent mode gives me 3peaks. while prior normalization on bedgraphs from trimmed files give me 2 peaks.

I guess It is the case of high background in IgG. Is it expected that ecoli_reads normalized bedgraph with non setting is drastically different from using IgG for normalization.

Let me know your thoughts.

Have a good weekend.

Pankaj

mpmeers commented 4 years ago

Hi Pankaj,

It sounds like using spike normalization didn't make much of a difference. It is striking that there's such a large discrepancy in peaks called, and unfortunately I think it's because SEACR is highly sensitive to background. I have found that there are some edge cases in which SEACR improperly omits lots of peaks particularly when background is high, and if you'd like to send me a sample of your data (e.g. chr1 only), I can try to look more carefully into whether those specific issues might be occurring in your case. Alternatively, standard ChIP-seq peak callers (e.g. Macs2) are somewhat better suited to call local sites of enrichment from data in which background is unavoidably high, so that might be worth trying. Finally, if there are clear sites of enrichment, it may be worth trying to use SEACR with a user-assigned threshold rather than using the IgG control file. You can see how to call peaks that way in the "examples" section of the README.

Mike

RanjanaJambu9 commented 3 years ago

Hi Mike, I was trying to validate a SEACR pipeline using data from https://www.protocols.io/view/cut-amp-tag-data-processing-and-analysis-tutorial-bjk2kkye?comment_id=101471. I see a huge difference in the number of peaks called when the IgG_rep1 control is trimmed versus not trimmed. The peak numbers for Hs_K27m3_rep1, Hs_K4m3_rep1, Hs_K4m3_rep2 match pretty well and look ok, whether you trim the low quality reads or not. But this one case Hs_K27m3_rep2 shows 73956 peaks when not trimmed, but when trimmed it shoots up to 243649 peaks. I am running SEACR on non and stringent mode with Ecoli K12 spikein. If I look at the IgG reads before and after trimming, just a few 1000 reads are eliminated. The bedgraphs look slightly different but the way SEACR calls peaks is hugely different for only this sample. The other samples are not sensitive to IgG trimming. Would you know why this is happening?

Ranjana