FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN
GNU General Public License v2.0
100 stars 44 forks source link

Ecoli normalisation issue #27

Open leiendeckerlu opened 4 years ago

leiendeckerlu commented 4 years ago

Hi there,

I'm trying to apply the normalization approach utilizing the E.coli genome mapping reads, which I'm doing like this

bedtools genomecov -bg -i input.bed -g hg38_Ecoli.txt -scale scalingFactor > output.bedgraph

where I calculate the scalingFactor as following:

scalingFactor = ( 1 / EcoliMappingReads) * 10^6

What I'm now observing is that the number of Ecoli mapping reads is up to 20x different between samples from the same experiment.

Any input on that is highly appreciated!

Thank you, Lukas

mpmeers commented 4 years ago

Hi Lukas,

How many E. coli reads are you observing when you map? If the issue is that replicates, for instance, are highly discordant where the expectation is that they will be roughly equal, it's possible that a low number of reads mapping to E. coli yields enough variance that the samples get thrown off--we typically aim for 0.1-0.5% of total reads to map to the spike-in genome. It's also necessary to keep the amount of pA/pAG-MNase and ConA beads constant during the processing of samples in order to use the E. coli spike-in as an accurate proxy. As we outlined in our Elife paper, we suspect that the E. coli carryover is due to residual DNA from pA/pAG-MNase purification that binds to the ConA beads and then is released during the 2XStop step, so if there's any variance in the amount of those reagents, that might affect the amount of E. coli DNA you get back. Finally, there's obviously variance in the number of cells between samples even when doing rigorous cell counting, so that's something to monitor. Outside of those possible rationales, it is the case that different targets yield vastly different amounts of DNA, and 20x difference would not be out of the ordinary for an abundant histone modification vs. a transcription factor, for instance. Let me know if that's helpful at all.

Mike

leiendeckerlu commented 4 years ago

Hi Mike,

I'm observing between 2.00-6.000 E.coli reads across samples. In general, I observe that in samples profiling for histone marks I'm in the hundreds with E.coli reads, whereas for TFs and the guinea pig control I'm in the thousands. The correlation between replicates (duplicates) is good. The samples are sequenced to a depth of roughly 20M PE reads, so I'm way below your 0,1% of total reads that map to the spike-in genome. The 20x difference in E. coli reads I'm observing between two highly abundant histone marks in a treated/untreated comparison. Yes, I'm aware of the underlying idea of the E. coli carryover DNA, which is why I kept beads, cells and MNase amount as constant as possible during the whole experiment.

Is your conclusion from what I'm telling you here, that the spike-in is not reliable enough in my case due to the fact of not having enough reads on the spike-in genome? Do you think that due to the size of the E.coli DNA fragments post-library prep size selections could introduce biases here? e.g. by performing an additional size-selection on a sample that still has adaptor dimers vs. another sample that doesn't have those and so 'by accident' getting rid of the E. coli fragments?

Thanks, Lukas

mpmeers commented 4 years ago

Hi Lukas,

The fact that your replicates are generally concordant is a good sign. Since you do have relatively few E. coli reads, I'd recommend being careful about doing quantitative measurements that are in the low-fold range unless you have the bandwidth to do a couple more replicates so that you can derive measures of inter-replicate variance/dispersion. However, it should still be able to give you a general sense of how abundant your different epitopes are in different conditions (and your observation of different spike-in levels for histone marks vs. TFs supports this).

The point you bring up about size selections (which I presume you're doing using Ampure beads or some equivalent) is interesting, and something I haven't thought of before. I can only say that we routinely do extra rounds of cleanup on some samples and not others and haven't noticed a striking correlation with different spike in amounts, but perhaps that's worth monitoring.

As for your treatment vs control, do you already know (e.g. from a western blot or some other method) that the spike-in numbers are "wrong"? It seems odd that one particular sample would be so far off in the spike numbers, so I'd be interested to see whether your hypothesis regarding cleanup bears out if you do another replicate or two of this particular experiment. Apologies that this isn't more helpful, but in general it seems like the other spike-in metrics you report are in line with what we'd expect.

Mike