PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
74 stars 6 forks source link

modification probability #17

Closed yuntwang closed 2 years ago

yuntwang commented 2 years ago

Hi, a little doubt. Here is my analysis:

Chr01 13218 13219 6.4 hap2 4 0 4 0.0 Chr01 13239 13240 4.8 hap2 4 0 4 0.0 Chr01 13247 13248 5.0 hap2 5 0 5 0.0 Chr01 13254 13255 4.8 hap2 5 0 5 0.0 Chr01 13268 13269 5.0 hap2 4 0 4 0.0 Chr01 13296 13297 4.9 hap2 5 0 5 0.0 Chr01 13310 13311 4.9 hap2 6 0 6 0.0 Chr01 13316 13317 4.9 hap2 6 0 6 0.0 Chr01 13324 13325 4.9 hap2 6 0 6 0.0 Chr01 13331 13332 4.9 hap2 6 0 6 0.0

Information per column: reference name start coordinate end coordinate modification probability haplotype coverage estimated modified site count (extrapolated from model modification probability) estimated unmodified site count (extrapolated from model modification probability) discretized modification probability (calculated from estimated mod/unmod site counts)

Is the real methylation site the third column? Column 7 is the potential estimated number of reads supporting methylation? If this is the case, why does the number of reads support 0, but there is still a methylation level?

dportik commented 2 years ago

@yuntwang this is the expected behavior when running the model-based mode.

The model adds a correction to the methylation probability, and the number of methylated/unmethylated reads is then extrapolated from this corrected value.

As you can see, the 4 unmethylated and 0 methylated reads is a much closer fit to the estimate of ~5% modification probability for that column. The alternative is 3 unmethylated and 1 methylated reads, which is closer to ~25% modification probability (and not ~5%) for that column. This would be a poor extrapolation and the results would be misleading.

The extrapolation is expected to be better with higher coverage, and much more coarse with lower coverage. It might be useful to use the minimum coverage filter to remove sites with <10x or even <20x coverage.