deeptools / deepTools

Tools to process and analyze deep sequencing data.
Other
670 stars 207 forks source link

GC correction for metagenomes #394

Closed apaytuvi closed 8 years ago

apaytuvi commented 8 years ago

I think we should normalize the GC bias when doing metagenomics. We perform a metagenomics assembly and then we map the input reads against the assembly to quantify. But, of course, since prokaryotes can have a very wide range of GC content, the bias here is evident. Bacteria with an average of 40-60% GC will get more sequenced than a bacteria with a GC content of 20%.

What do you think about it? Your approach is valid in this sense?

dpryan79 commented 8 years ago

I'd be hesitant to apply this to metagenomics, I guess partly because I'm not familiar enough with it. One of the assumptions of the GCbias correction is that there shouldn't be an enrichment of coverage over a particular GC content % stretch. I could easily envision that that would instead be expected in many metagenomics contexts.

dpryan79 commented 8 years ago

I'll add that at least for what we do GC bias is largely now absent, the polymerases have significantly improved. I can't say whether that's also the case for metagenomics library prep.

friedue commented 8 years ago

Tricky issue. First, there's the technical part: computeGCbias needs a compressed fasta file of the reference genome to get an idea of the genome's GC content. since you're assembling with Illumina short reads (I assume), you're already introducing some unknown bias there (though there's no way to quantify that until you re-assemble the genome with, for example, PacBio reads). The most straight-forward approach, IMHO, is to focus the quantification on regions with similar GC content across all bacteria genomes.

apaytuvi commented 8 years ago

Thank you for your replies. But, theoretically, by knowing the distribution of the GC content across the Illumina reads (e.g. this plot), we could correct the reads. For example, let's say than a region of 20% GC is half-sequenced than a region with 45% GC.

friedue commented 8 years ago

In principle, this would be possible (and similar to what correctGCbias is doing). However, I don't see a way to make sure that you're not introducing new (more?) bias. After all, the Illumina sequencing is notoriously non-uniform and just because one region with 65% GC was dramatically over-amplified, this may not necessarily be true for all the regions with 65% GC. If your goal is to obtain reliable quantifications across different organisms (presumably, the bias starts already at the point of shearing the DNA, which may differ for different GC contents...), I would try to find a solution that allows you to focus on regions that have similar sequence characteristics.

dpryan79 commented 8 years ago

You can correct regions within a given organism, but that's not the issue in metagenomics. The problem is that you can't discriminate between having an enrichment of a species with a funky GC content from having a sequencing bias. In your context, a "relative coverage" of ~1 might occur when there's bias, rather than indicating the opposite. In other words, GC bias computation can't be performed on a sample that needs to be used to estimate signal enrichment. This is the same reason we don't compute GC bias on ChIP samples and instead use their input controls, since only then do the assumptions underlying the bias computation hold.