False Negative Peaks - Githubissues

FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN

GNU General Public License v2.0

104 stars 45 forks source link

False Negative Peaks #36

Open DKaukonen opened 4 years ago

DKaukonen commented 4 years ago

Hi,

I have several CUT&RN experiments looking at Histone modifications H3K4me3 (narrow peak) and H3K27me3 (broad peaks) in treated vs untreated cells, and seem to have several instances where the bedgraph file shows a peak, but there is none called. The images below are from IGV, where green is H3K4me3 and red is H3K27me3, and blue is the IgG track.

The two tracks below the bedgraph pileups are the peaks called using stringent and then relaxed.

K27_issue1

In this first one, which is a treated sample, the .bed file has 0 bytes, but the bedgraph file clearly shows what should be peaks.

K27_issue2

This is the same region for the untreated sample. It calls the peaks here, and if I just take the bed file at face value, it would seem that there is a difference between treatment and untreated, when there is not.

K27_Issue4

Above is a more prominent example.

K4_Issue3

K4_Issue4

K27_issue_1

It also seems to occur in the K4 samples too.

K4_Issue2

Then there is the opposite issue, where it does not look like there are peaks there, but it still calls them.

I ran the pipeline using Bowtie 2 with the recommendations from your publication, although I also marked duplicates with Picard.

Then I ran the following ($fn is the sample name in this instance)

`for fn in $filelines do

# Sort by read name
samtools sort -n  "$fn"_aligned.sorted.mkdup.bam | bedtools bamtobed -bedpe -i stdin > "$fn".bed
awk '$1==$4 && $6-$2 < 1000 {print $0}' "$fn".bed > "$fn".clean.bed

cut -f 1,2,6 "$fn".clean.bed | sort -k1,1 -k2,2n -k3,3n > "$fn".fragments.bed bedtools genomecov -bg -i "$fn".fragments.bed -g /home/OnBoardStorage/dcote/software/genomes/Hg38/UCSC/hg38.chrom.sizes > "$fn".fragments.bedgraph

done

while IFS=, read experiment IgG output do

bash SEACR_1.3.sh ${experiment}.fragments.bedgraph ${IgG}.fragments.bedgraph norm stringent ./peakfiles_stringent/${output}
bash SEACR_1.3.sh ${experiment}.fragments.bedgraph ${IgG}.fragments.bedgraph norm relaxed ./peakfiles_relaxed/${output}

done <SEACRconfig.txt `

Is there anything that I am doing wrong?

mpmeers commented 4 years ago

Hi,

It doesn't appear as though you've done anything wrong in the bam to bedgraph processing. You mentioned that you marked duplicates--did you remove them at all? Typically as long as the percentage of duplicates is in the single-digit range that we expect, we don't remove them, so it's possible that duplicate removal biased the target vs. IgG background (since IgG is likely to have more duplicates) and threw off the threshold calculation.

Even if you kept duplicates, this gets to a broader point that the quality of the IgG file can influence how the peak calling goes, since SEACR relies on an accurate estimation of short background "peaks" that are present in every CUT&RUN/Tag dataset. It's hard for me to tell exactly from your screen shots, but it looks like the IgG background is pretty consistently covering the genome, which could mean that there are large contiguous blocks of signal that have the "appearance" of peaks to the algorithm, since it doesn't use a peak height threshold or local background estimation the way other peak callers do. This could particularly be the case if the datasets are very deeply sequenced. What sort of read depth were these samples sequenced to? We typically don't sequence any deeper than 10M paired-end fragments for CUT&RUN from human genomes, so if you're significantly above that it may be worth subsampling fragments and giving it another try. Let me know if either of those ideas help at all.

Mike

DKaukonen commented 4 years ago

Hi,

I had tried remove duplicates as H3K4me3 and H3K27me3 has about 10-25% duplication rate, but another similar target had about 70-85% duplication rate because they are a rare occurrence. Our sequencing depth was pretty high because we had 72 samples. One lane wasn't enough but two lanes was a bit of overkill (~15-20 million reads per sample).

I have tried now down sampling to 8 million reads and not removing duplicates and it did not seem to help, and in some cases I think it made what was previously an okay sample look worse. Note, each sample (and each paired read) was split across two lanes, so I down sampled each lane to 4 million reads each before aligning (IE as a FASTQ file). Then I aligned, then I merged using samtools. I did it this way because merging before alignment seemed to create quite a few discordant reads for some reason.

SKBR3_K4_1

SKBR3_K4_2

As before, Green is K4me3 and red is K27me3. As you can see, the green clearly peaks higher than the IgG (Blue), but they are not called.

SKBR3_1

Some are called and some aren't

SKBR3_2

And again we have differences between groups, which would be falsely reported. As you can also see, the peaks do line up with the promoter regions, so am inclined to not doubt their validity.

BT474_Issue_All_1

One thing I did notice, if that is an issue, is that there are several samples where there is a high number of reads on Chromosome 3, mostly occurring in our BT474 lines, but some of the SKBR3's have it too. I see a higher number of reads around Chr17, but that is to be expected given that these are HER2 amplified cell lines. I do not see anything to suggest Chr3 has similar levels of amplification.

BT474_Issue2

Looking closer, it appears to be occurring on the centromere. Could this be causing an issue? It also appears in the IgG samples.

-Damien

DKaukonen commented 4 years ago

I should add that this time there was only one sample with an empty bedfile, so it seems we are on the right track

mpmeers commented 4 years ago

Hi,

The repeated region you're looking at on Chr3 is a common occurrence in CUT&RUN data since repeated regions are more likely to be represented in the background--that's why it's enriched in the IgG as well. I've found that SEACR functions normally with these sorts of regions though, so I don't think it's the source of your issues.

SEACR is naturally conservative in calling peaks, and the peak calls that are returned should generally be interpreted as the those that are highly unlikely to be false positives (SEACR returns as empirical FDR that reflects the proportion of IgG vs. target peaks above the threshold to give an idea of how likely a false positive is). That means that there can be some peaks that are plausible true positives that might be missed. For that reason, in comparing differential peak occupancy, I recommend against comparing peak intersections between two conditions, but rather quantifying the amount of signal in every peak called from either condition (i.e. the union set of peaks), similar to what I described in issue #19 regarding replicates. For a more comprehensive set of peaks that could plausibly be positive but are more likely to be false, you could use SEACR without an IgG control and instead provide a percentile threshold in the second input field (e.g. 0.01, 0.001, etc.), and test a range of thresholds to see which one captures the set of peaks that best matches your desired analysis strategy. Alternatively, peak callers such a Macs2 are much more permissive at the expense of including more false positives, so that might return something closer to what you're looking for.

All that being said, the screenshots you showed me are out of the ordinary for the sort of peaks that SEACR is usually "missing", so if you're interested in pursuing this further and would be willing to send me data privately, I'd be happy to go through it to try to better pinpoint your specific issue, and perhaps try to use it to improve SEACR for the next version. Let me know what you think.

Mike

DKaukonen commented 3 years ago

Hi Mike,

Apologies for the late response but we have been waiting for the data to be published to share it (we are also trying a new lab method with it).

I ended up using HMCan, which, while designed for ChIP-seq, also takes into consideration copy number variation and GC-content bias for amplification. It has worked quite well and thought I would share in case those are two metrics you would like to consider in SEACR.

I have now been on another project where we have used CUT&Tag and the peaks and IgG have been far more "clean". I decided to use SEACR again and we have a similar issue. Below is one region where we have the H3K4me3 and H3K27me3 BigWig (converted from the Bedgraph using UCSC tools) with their bed files from the relaxed run of SEACR 1.3. Below them are the IgGs

SEACR_example_1

SEACR_example_2JPG

As you can see the IgGs have small peaks, and the peaks between both replicates are identical for each histone mark, but for some reason the peak is only called in one and not the other. This is quite frequent throughout the samples, even switching which replicate has the peak called, despite having similar looking peaks. I am hesitant to use HMCan here as the run looks more textbook for how CUT&Tag should look compared to my previous CUT&RUN samples above and I am not sure it will handle the sparse reads.

Do you have any comments or suggestions about this?

-Damien

mpmeers commented 3 years ago

Hi Damien,

It does look like some of your samples are undercalling peaks, particularly in this region. Is this an issue genome wide, i.e. are there many more peaks in some samples vs. others, or are there just a few select instances of peaks calls missing as shown here? I just want to get an idea of the degree to which the thresholding is off. Also, if these data are published or will be soon, can I access them myself? It might be helpful for me to test a few internal parameters to try to figure out what's going on.

In the mean time, if it's only a few peaks that are missing, one potential workaround for analysis would be to merge peaks from all your conditions and re-map overlapping reads from every sample to the merged peak set. In this way, even when peaks are missing from one dataset, they get recovered from being called in the other datasets and you can still make comparisons across them. Since SEACR is rather stringent in general, this is sometimes a preferred strategy to simply intersecting peak lists.

I'm not familiar with HMCan, but it appears that it's reasonably designed and if it works for your data it's certainly worth trying. The only thing I would consider is whether you can bypass the read extension step, since discrete fragment sizes are available from CUT&Tag paired end data.

Let me know if this is helpful, and/or whether you can point me towards somewhere I could look at your data. Good luck.

Mike

DKaukonen commented 3 years ago

Hi Mike,

We have about 533 deferentially marked peaks, and every one we have checked (we haven't checked them all, but at least 15 or so) come up with the same issue. There does not appear to be a pattern to it, as they are not located in any particular region (chromosome, centromere etc), nor does it seem to be tied to a specific sample or replicate (although Control 1 H3K27me3 appears to be missing the peaks more often, other replicates and samples still have missing peaks).

How would you go about merging them? I have done merging before using bedtools, but I do not follow your instructions. Could you provide an example so I can make sure I get it right and post the results?

I will be sure to send you the link when either data set is made public.

Thank you again,

-Damien

mpmeers commented 3 years ago

Hi Damien,

See my first response in the thread for Issue #19 for general recommendations on merging. From a code perspective, you could either cat all your files and pipe to bedtools merge, e.g. cat *relaxed.bed | bedtools merge -i - > merged.bed, or you could use bedops -m (documentation here). To remap to the merged bed, I typically use bedtools intersect with the -c flag, where the -a file is merged.bed and the -b file is a bed file containing all fragments from the sample of interest. It's a little unwieldy when you have a lot of samples to remap, so bedtools multicov that can handle multiple BAM files at once is an alternative, although in that case you'll be mapping only the ends of the fragments (i.e. the reads listed in the BAM file) rather than the full fragments. The end product is a matrix with each row corresponding to a bed region from the merged bed file and each column representing reads/fragments mapped from a particular sample. Not sure if that's specific enough, but I'd be happy to correspond over email for a more detailed description of code considerations. Let me know if that's helpful.

Mike

DKaukonen commented 3 years ago

Hi Mike,

I follow to the point of creating a new reference to map to (the cat *relaxed.bed | betools merge). With remapping, would I loop over every relaxed.bed for -b? For example if I have 9 relaxed.bed files, I would do that 9 times?

I am also not sure about the final format and how I would expect to feed it downstream analysis (such as visualizing in IGV or doing differential peak analysis). I think I get the idea that I would have to do the analysis from that file on a column by column basis, is that correct? If not, we can correspond via email.

mpmeers commented 3 years ago

Hi Damien,

For remapping, the -b files would be bed files containing all the fragments in a particular sample, not the "*relaxed.bed" file--sorry that wasn't clear. So in this case if you have four samples for a particular histone mark, there would be four separate calls to bedtools intersect -c using the same merged.bed file as the -a. Again, a little unwieldy, so an alternative would be to use bedtools multicov with merged.bed as the bed and a list of BAM files for each sample as the -bams. Let me know if that makes sense.

Mike

DKaukonen commented 3 years ago

Hi Mike,

Just two things to confirm. For the merged.bed file, there is one for each experimental condition and histone mark, correct? For example Control_H3K4me3.merged.bed and Control_H3K27me3.merged.bed, correct? So with 4 conditions, 2 histone marks, and 2 replicates, I would have 8 merged files. Or is it one large merged file? My experience says it would be the former, but your instructions imply the latter.

Lastly, the bed file from the remapping, that would be the final file for downstream analysis, correct? Or would I need to call peaks on that file?

Apologies for the questions, but I am more of a visual learner. I am trying to make a diagram of this pipeline and these were the two things I was not following. I have also been unable to locate your email.

-Damien

mpmeers commented 3 years ago

Hi Damein,

Apologies for the delay in responding. Merged files would represent each individual histone mark, so the former is correct. And the final file would be the merged bed file with your different datasets remapped onto it, so that there is a column representing each dataset--in your case, three bed entries (chr, start, end) plus 8 more columns for 4 conditions x 2 replicates.

You can email me at mmeers at fredhutch dot org with more specific questions if you'd like.

Mike

DKaukonen commented 3 years ago

Hi Mike,

Thank you for the response. I am responding here with general questions in case anyone is following along. I ran the following code so far

cat *relaxed.bed | sort -k1,1 -k2,2n | bedtools merge -i - > K27_merged.bed as I noticed bedtools wanted a sorted input for merge, and

for file in *fragments.bed; do bedtools intersect -c -a K27_merged.bed -b $file > "$file".intersect.bed ; done to create the intersected bed files.

There are 2 questions that I was interested in your input on. 1) there are a lot of areas with fragments numbering in the tens to low hundreds. I imagine we want to have a cutoff value for removing these peaks before annotation? Do you have any value range you use?

2) Some of the regions spread pretty far, where the majority of the peak is in the middle, but there are short fragments that are overlapping other genes, despite not being that high of a pileup (no peak on the bedgraph/bigwig file). How do you handle these regions? Will all genes be annotated as having a peak, despite the majority of the reads occurring on one gene? Or peaks on multiple but not all of the genes in the region?

-Damien

mpmeers commented 3 years ago

Hi Damien,

Usually I don't implement cutoffs unless there is a clear outlier set, e.g. if peak read counts are clearly bimodally distributed around 10s-100 and 1000s. Also, since each peak is of a different length, the number of reads mapped is partly a reflection of that, so low counts in some peaks relative to others might be expected. If there exist called peaks for which there are very few reads in any condition, these might be candidates for filtering, but I'd tailor it to your data distribution rather than using a one-size-fits-all approach.
I'm not quite sure I understand this question, since the purpose of generating the matrix is understanding how peak usage differs between different conditions, and so the continuous read count value at each peak is the readout rather than its status as being annotated as having or lacking a peak. It sounds like you're in part referring to potential outlier regions (sites without a clear peak by visual examination) which was addressed in 1 above, but I'm not entirely clear on the structure of peaks you're describing. I can address that privately if you'd like.

Since these questions are no longer directly relevant to SEACR, I'd like to point out that there is a protocols.io page specifically focused on data analysis that you can reference and pose questions for troubleshooting help where some of your issues may have already been addressed by other users. I'm also happy to answer additional questions privately as I mentioned. I'll leave this issue posted for users interested in the issues pointed out with SEACR earlier in the thread.

Mike

qudrat-nii commented 1 year ago

Can you please tell me how to use SEACR to call peaks where spike-in DNA has been added to samples for normalization.