FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

Confusion on how coverage is estimated #675

Closed RJDan closed 1 week ago

RJDan commented 1 week ago

Hi I apologise if this is a silly question but I am feeling thoroughly confused about the meaning of coverage.

I have seen this : https://github.com/FelixKrueger/Bismark/issues/338 and this : https://github.com/FelixKrueger/Bismark/issues/338 and I understand how average coverage is estimated (manually) as a value relative to the total genome size from the number of reads.

This is where I get confused. The 'fold coverage' here: https://github.com/FelixKrueger/Bismark/issues/47 and the 'percent genomic coverage', from the nucleotide_stats output here :

(di-)nucleotide count sample    percent sample  count genomic   percent genomic coverage
A       274599799       30.77   490449567       32.29   0.560
C       170032783       19.05   268982894       17.71   0.632
G       171292353       19.20   268948313       17.71   0.637
T       276428520       30.98   490344872       32.29   0.564
AA      92247813        10.42   174361572       11.48   0.529

These are not number of reads at a particular position, but rather 'what proportion of the genome-wide base combination xxx do we have represented in our data'. i.e. it has nothing to do with the number of reads at any position. Do I understand this correctly?

And then just to be sure I understand the final 'coverage' value for C, the coverage reported by multiqc in the 'general_stats' file, e.g. :

Sample  Bismark_mqc-generalstats-bismark-percent_cpg_meth       Bismark_mqc-generalstats-bismark-percent_chg_meth       Bismark_mqc-generalstats-bismark-percent_chh_meth       Bismark_mqc-generalstats-bismark-total_c Bismark_mqc-generalstats-bismark-C_coverage
split1_20220720.A-o287562_01-C1P_R1_val_1       16.4    1.6     1.2     743442316.0     0.632  

These are the average number of reads (after all the trimming, deduplication, etc) at any C for the particular sample. These values have nothing to do with the number of C in our dataset relative to what is in the reference genome?

FelixKrueger commented 1 week ago

'what proportion of the genome-wide base combination xxx do we have represented in our data'. i.e. it has nothing to do with the number of reads at any position. Do I understand this correctly?

This is correct. If I remember this correctly this measure was meant to identify whether certain treatments or library preps introduce biases, e.g. favour or avoid GC rich sequences altogether.

Just for the record, the 'percent genomic coverage' from the nucleotide_stats output are 2 separate fields: percent genomic and coverage (which is tricky to see in a non tab-separated form).

Regarding the MultiQC report, I am not actually quite sure what this number is, maybe you can raise this with Phil or Vlad over the MultiQC repo?

I am a little confused because the coverage 0.632 is the same as in your example above, but this could be just coincidental... If memory serves me right it might also be in the correct area if you divide 743442316 (Cs covered in the sample) by 1.2 billion or so (the number of Cs in the human genome). Are you supplying the nucleotide coverage report to MultiQC when running it?

As a final word, I have to confess that I haven't really looked at this coverage statistic in many years....

RJDan commented 1 week ago

Thanks for the quick reply.

I had a look over some other values from other data and data with higher coverage sequencing had multiQC coverage values > 1, so that precludes it being a 'proportion of genome-wide cytosines'.

Would it be possible to only use 'coverage' to mean one thing in the reports to make it less confusing?

Yes, multiQC was provided with the nucleotide reports.

FelixKrueger commented 1 week ago

I don't think it was ever meant to be an indication for 'proportion of genome-wide cytosines', but rather a (rather crude) measure or fold-coverage. e.g. we assume 1 billion Cs in the genome, and the data has 15 billion methylation calls, the rough coverage would be 15X.

The fraction of Cs covered genome-wide can be calculated using the total Cs in the genome, and the number of positions (lines) in the bedGraph/coverage file.

RJDan commented 1 week ago

Thank you for the help!