fangwuwang / team_Bloodies

0 stars 2 forks source link

What do you mean by coverage? #10

Open psomdeb25 opened 7 years ago

psomdeb25 commented 7 years ago

@singha53

I am doing DNA methylation analysis on my dataset. A lot of places have mentioned coverage. But I am not quite able to grasp the concept of what coverage is in a DNA Methylation data. I tried looking up literature but I am not able to get a clear meaning for the same.

Thanks!

ppavlidis commented 7 years ago

It might help to refer specifically to a citation/quote. But I think I know what is implied.

For methylation microarrays, coverage would (probably) refer to how many CpGs are assayed. There are something like 20 million CpGs in the human genome. 450,000 of those are "covered" by the common Illumina platform (more for the new version). As a fraction it would be between 0 and 1.

For sequencing it means something analogous, but there's "depth" (number of reads per base) as well as "breadth" (in this context, how many CpGs are "detected" with sufficient data to make a call), whereas for the microarrays we're only talking about breadth.

I hope that helps ...

psomdeb25 commented 7 years ago

Thanks for the information! It clears my confusion to some extent.

For your reference, this is one of the papers which I was looking into. High_Density_DNA-Meth_Array.pdf

In the section 2.3 they mention Gene Coverage, and in 2.4 CpG Island coverage. So, I assume that it would mean - number of reads (methylation signal) that the analysis outputs, in an entire gene in the first scenario, and in the regions of CG (or CpG) in the second scenario.

ppavlidis commented 7 years ago

They're talking about a microarray, so there's no reads. There are probes, and the signal is from fluorescence of the labeled DNA that is hybridized to the array. (If you are confused about the difference between microarrays and sequencing you should ask someone to go over this with you.)

They do a sequencing-based assay for "validation" but I don't think they use the term coverage there - they're just comparing their beta values to show how great the microarray is (treating sequencing as the gold standard). It can also be confusing because Illumina makes both microarrays and sequencers.

Anyway the definition of coverage is what I used for microarrays, not for sequencing.

For a genome feature (here meaning a contiguous span of nucleotides), they mean "the platform has at least one probe for a CpG that lies within that feature".

Thus for gene coverage, the feature is a gene - that needs to be defined too, because it's not like genes have little green lights to show you where they "start" and "end". They seem to mean some coordinates taken from Refseq, but I don't see where they say clear. That's a detail but obviously if they change the definition of "gene" the "coverage" would change.

For CpG islands (regions that are relatively rich in CpG islands without clearly defined borders, often but not necessarily near the 5' end of genes), same idea.