Closed RJDan closed 1 week ago
'what proportion of the genome-wide base combination xxx do we have represented in our data'. i.e. it has nothing to do with the number of reads at any position. Do I understand this correctly?
This is correct. If I remember this correctly this measure was meant to identify whether certain treatments or library preps introduce biases, e.g. favour or avoid GC rich sequences altogether.
Just for the record, the 'percent genomic coverage' from the nucleotide_stats output are 2 separate fields: percent genomic
and coverage
(which is tricky to see in a non tab-separated form).
Regarding the MultiQC report, I am not actually quite sure what this number is, maybe you can raise this with Phil or Vlad over the MultiQC repo?
I am a little confused because the coverage 0.632 is the same as in your example above, but this could be just coincidental... If memory serves me right it might also be in the correct area if you divide 743442316 (Cs covered in the sample) by 1.2 billion or so (the number of Cs in the human genome). Are you supplying the nucleotide coverage report to MultiQC when running it?
As a final word, I have to confess that I haven't really looked at this coverage statistic in many years....
Thanks for the quick reply.
I had a look over some other values from other data and data with higher coverage sequencing had multiQC coverage values > 1, so that precludes it being a 'proportion of genome-wide cytosines'.
Would it be possible to only use 'coverage' to mean one thing in the reports to make it less confusing?
Yes, multiQC was provided with the nucleotide reports.
I don't think it was ever meant to be an indication for 'proportion of genome-wide cytosines', but rather a (rather crude) measure or fold-coverage. e.g. we assume 1 billion Cs in the genome, and the data has 15 billion methylation calls, the rough coverage would be 15X.
The fraction of Cs covered genome-wide can be calculated using the total Cs in the genome, and the number of positions (lines) in the bedGraph/coverage file.
Thank you for the help!
Hi I apologise if this is a silly question but I am feeling thoroughly confused about the meaning of coverage.
I have seen this : https://github.com/FelixKrueger/Bismark/issues/338 and this : https://github.com/FelixKrueger/Bismark/issues/338 and I understand how average coverage is estimated (manually) as a value relative to the total genome size from the number of reads.
This is where I get confused. The 'fold coverage' here: https://github.com/FelixKrueger/Bismark/issues/47 and the 'percent genomic coverage', from the nucleotide_stats output here :
These are not number of reads at a particular position, but rather 'what proportion of the genome-wide base combination xxx do we have represented in our data'. i.e. it has nothing to do with the number of reads at any position. Do I understand this correctly?
And then just to be sure I understand the final 'coverage' value for C, the coverage reported by multiqc in the 'general_stats' file, e.g. :
These are the average number of reads (after all the trimming, deduplication, etc) at any C for the particular sample. These values have nothing to do with the number of C in our dataset relative to what is in the reference genome?