Closed apaytuvi closed 8 years ago
I'd be hesitant to apply this to metagenomics, I guess partly because I'm not familiar enough with it. One of the assumptions of the GCbias correction is that there shouldn't be an enrichment of coverage over a particular GC content % stretch. I could easily envision that that would instead be expected in many metagenomics contexts.
I'll add that at least for what we do GC bias is largely now absent, the polymerases have significantly improved. I can't say whether that's also the case for metagenomics library prep.
Tricky issue. First, there's the technical part: computeGCbias
needs a compressed fasta file of the reference genome to get an idea of the genome's GC content. since you're assembling with Illumina short reads (I assume), you're already introducing some unknown bias there (though there's no way to quantify that until you re-assemble the genome with, for example, PacBio reads).
The most straight-forward approach, IMHO, is to focus the quantification on regions with similar GC content across all bacteria genomes.
Thank you for your replies. But, theoretically, by knowing the distribution of the GC content across the Illumina reads (e.g. this plot), we could correct the reads. For example, let's say than a region of 20% GC is half-sequenced than a region with 45% GC.
In principle, this would be possible (and similar to what correctGCbias
is doing).
However, I don't see a way to make sure that you're not introducing new (more?) bias. After all, the Illumina sequencing is notoriously non-uniform and just because one region with 65% GC was dramatically over-amplified, this may not necessarily be true for all the regions with 65% GC. If your goal is to obtain reliable quantifications across different organisms (presumably, the bias starts already at the point of shearing the DNA, which may differ for different GC contents...), I would try to find a solution that allows you to focus on regions that have similar sequence characteristics.
You can correct regions within a given organism, but that's not the issue in metagenomics. The problem is that you can't discriminate between having an enrichment of a species with a funky GC content from having a sequencing bias. In your context, a "relative coverage" of ~1 might occur when there's bias, rather than indicating the opposite. In other words, GC bias computation can't be performed on a sample that needs to be used to estimate signal enrichment. This is the same reason we don't compute GC bias on ChIP samples and instead use their input controls, since only then do the assumptions underlying the bias computation hold.
I think we should normalize the GC bias when doing metagenomics. We perform a metagenomics assembly and then we map the input reads against the assembly to quantify. But, of course, since prokaryotes can have a very wide range of GC content, the bias here is evident. Bacteria with an average of 40-60% GC will get more sequenced than a bacteria with a GC content of 20%.
What do you think about it? Your approach is valid in this sense?