FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN
GNU General Public License v2.0
100 stars 44 forks source link

biological replicates and spike-in #19

Open olechnwin opened 4 years ago

olechnwin commented 4 years ago

Hi, Can you please give a suggestion on how to use SEACR with experiments that have spike-in and replicates. What do you recommend as far as normalization with spike-in and how to handle replicates?

Thank you.

mpmeers commented 4 years ago

Hi,

We typically use the number of reads mapping to the spike-in genome as a direct scaling factor for our bedgraphs, in which the bedgraph signal (column 4) is multiplied by a constant divided by the # of spike-in reads. If you are confident in your spike-in, you can use spike-in-calibrated bedgraphs for target and control in SEACR with the "non" option in order to forgo internal normalization of the bedgraph values and retain the pre-calibrated ones. I'd recommend trying both "non" and "norm" to see what they turn up.

As for replicates, it depends on your goal. If you're simply trying to see how similar your replicates are, you could take the union of all peaks called in any single replicate and re-map paired-end BAM or BED signal onto those intervals, and calculate correlations. Since SEACR is designed to be rather selective in its peak calling, I would be wary of using the intersection of peaks (e.g. Venn diagram overlap) to determine similarity, unless you're only interested in the very strongest peaks from your data. For using replicates from two different conditions to compare them, again using the union set of peaks from all replicates and then remapping reads in order to use some differential usage tool like DiffBind (https://bioconductor.org/packages/release/bioc/html/DiffBind.html) makes sense.

Let me know if that's helpful.

Mike

olechnwin commented 4 years ago

@mpmeers,

Thank you very much for your helpful suggestions. I will try these and see what happen.

yaoyaoyaochen commented 4 years ago

Hi Mike, What dose the number of reads mapping to the spike-in genome stand for? Dose it represent the amount of Tn5 or the cutting power? I am trying to understand what dose this spike-in genome normalizing for? Thanks.

mpmeers commented 4 years ago

Hi,

Since the amount of Tn5 and the cutting efficiency should, in theory, be constant across samples, differences in amount of DNA should arise solely from the number of individual target molecules that are cut/tagmented and amplified in sequencing in a given experiment. For instance, in a hypothetical situation in which a feature (histone modification, TF, etc.) is targeted in two CUT&RUN/Tag experiments with identical numbers of cells, sequenced to identical depths, and found to have identical distributions of peaks, but the number of spike-in reads is different, we'd infer that in the sample with fewer spike-in reads, a higher number of cells contains the targeted feature at a given locus. Hopefully that clarifies it some.

Mike

yaoyaoyaochen commented 4 years ago

Hi Mike,

Do you think that this normalization will amplify the difference of signals between samples. For example, I have cell A and cell B with 2M reads. For A, 60 %(1.2M) reads align to targeted genome with 40% (0.8M) spike-in reads. For B, 80 % (1.6M) reads align to targeted genome with 20% (0.4M) spike-in reads. When I divide targeted reads by spike-in reads, targeted reads for A is 1.5M and targeted reads for B is 4M. Thus, the difference of overall signal intensity will be enlarged between A and B. To my understanding, the spike-in DNA is like GAPDH or actin in western blot, which can be used to normalize the difference in starting material. Dose this make sense to you? Let me if I miss understand the data processing settings. Thanks.

Yao

mpmeers commented 4 years ago

Hi,

This is actually reflecting the correct ratio of input DNA. Suppose that the total arbitrary "DNA units" contributed by the spike-in is equal to 10. In order to get 60% of total reads mapping in the combination of A and the spike-in DNA, A must contain 15 units (15/(15+10) = 60%), whereas B must contain 40 units (40/(40+10) = 80%). As you outlined above, if a given feature makes up an equal proportion of the signal from A and B, using the spike calibration in the manner I described will give a 1.5:4 ratio of signal for that feature between A and B, which is equal to the "true" ratio between them. Hope that helps.

Mike

laijen000 commented 4 years ago

Hi Mike,

Thank you for this tool! I would like to profile a TF's binding sites and have 2 replicates where I added the TF antibody + one IgG control (all the same cell type, prepared on the same day, same sequencing run). I would like to use the replicates to obtain a high confidence set of peaks for downstream analysis (such as seeing what genes the peaks are nearby to infer target genes of this TF). Do you have recommendations on how to do this? I wasn't quite sure from your previous message how to re-map reads from each sample to a set of union peaks and calculate the correlation.

Secondly, when I call peaks using the TF, the IgG control, normalization (no pre-spike in calibration), stringent or relaxed, I only get ~30 peaks. However when I use just the TF sample and specify 0.01 as the threshold, I get ~8k peaks (and de novo motif finding using MEME on these 8k peaks identifies enrichment of the TF's motif with reasonable footprinting). I'm wondering whether it is acceptable to call peaks without an IgG control using this option, and how this option calls peaks differently from using an IgG control?

Lastly, do you recommend spike-in normalization for both the TF sample and the IgG control before peak calling (when the TF and IgG were performed in the same cell type, same cell number)? I am a little confused because I also read that spike-in normalization is only needed when comparing 2 different conditions, such as WT cells and TF KO cells?

Thank you so much for your help and clarification of these points!

Potential workflow:

  1. after trimming and alignment, use all fragments (do not filter by size or duplicates)
  2. normalize sample and IgG control by spike-in [? necessary?]
  3. call peaks (specify threshold 0.01 instead of using IgG control if IgG control is too stringent?)
mpmeers commented 4 years ago

Hello,

Regarding replicates, though I generally advocated against it above since I think it will be overly restrictive in many cases, it sounds like simply calling the peaks from individual replicates and then using the intersection set of peaks overlapping between them might be your preferred approach to get a "high confidence" set. Another alternative would be to use correlation analysis via remapping to the union set of peaks (e.g. by using bedtools coverage to count the number of paired end fragments in each replicate that map to each peak in your union set) and, in the event that the correlation is high, merge the two replicates into a single dataset and call peaks from that. It might be worth trying both approaches to see what best suits your needs.

Regarding non-IgG peak calling, it's certainly "acceptable" with the caveat that peaks that are also present in the IgG dataset won't be filtered out, which often correspond to repeat regions that could confound your analysis. To remedy this, it might be worth using a blacklist (e.g. from ENCODE) to further filter your peaks after they're called with a 0.01 threshold.

Regarding spike-in, it is useful when trying to quantitatively evaluate differences across samples in the same peaks, and can be useful for SEACR peak calling (i.e. "non" mode) if you're confident in the quality of the spike-in. In general I use "norm" mode for peak calling, in which case normalization is done automatically and therefore spike-in normalization is irrelevant to the actual peak calling, but in cases where the data are somewhat lower quality, internal normalization can reduce the number of peaks yielded, so spike-in normalization prior to peak calling with "non" mode can be an alternative. For our spike in scaling we simply multiply bedgraph signal by a constant divided by the number of reads mapping to the spike-in genome.

Regarding your workflow, it looks reasonable, though I would take a look at my comments regarding trimming in issue #38.

Hope this helps, and let me know if you have any other questions.

Mike

onebeingmay commented 3 years ago

Hi Mike, Thank you for the elaboration. One beginner's question: what to do when I have multiple IgG control samples (biological replicates) when calling peaks? Right now I am planning to merge IgG control bedgraphs and call peaks of my experiment samples using merged IgG control as background. Is this correct? @mpmeers

mpmeers commented 3 years ago

Hi,

I think merging the IgG control samples is reasonable assuming the resulting number of fragments between the merged IgG file and the target file are not substantially different. I've found that sometimes there can be issues when the number of fragments is 4-5 fold different between the two files, likely because the background is not equally represented between the two files. Otherwise I think merging is fine assuming the data were yielded from the same experiment.

Mike

cmonger commented 2 years ago

Dear Mike,

I have been reading through the discussion in this thread and am interested in applying SEACR as an alternative in my workflow instead of macs for cut&run.
Could you elaborate further on your spike-in normalisation strategy?

We have data with biological replicates where we have normalised with spike-in for all but our IgG control samples. Would you recommend normalising the 'treatment' samples to generate spike-in normalised bedgraphs and the IgG samples in a non-normalised way or using CPM normalisation?

Or would it be better to only provide the reads for both the treatment and IgG control samples in a non-normalised way and let SEACR handle the normalisation?

Best Regards Craig

mpmeers commented 2 years ago

Hi Craig,

If you're using "norm" mode in SEACR, it shouldn't matter how the bedgraphs are normalized because SEACR will internally normalize the treatment and control files to each other. An alternative approach would be to normalize all of your bedgraph files (treatment and control) by spike-in and then use "non" mode in SEACR. In my experience this can be helpful for targets such as a core histone that are broadly distributed across the genome, since the pattern will be similar to IgG and therefore normalization that scales the median signal between files is likely to undercount enriched regions. However, for most other applications, I prefer "norm" mode, and in that case spike normalization may only play a role downstream of SEACR where you're calculating differential peak enrichment, for instance (described to some extent in Issues #14 and #17).

In short, normalization is helpful for downstream differential analysis, but irrelevant for the actual mechanics of SEACR unless using "non" mode. Hope that helps.

Mike