dohlee / metheor

:comet: Ultrafast DNA methylation heterogeneity calculation from bisulfite alignments (Lee et al., PLOS Computational Biology. 2023)
GNU General Public License v3.0
43 stars 8 forks source link

possible errors in MHL calculation #19

Open oleksii-nikolaienko opened 1 year ago

oleksii-nikolaienko commented 1 year ago

Hi, I am checking MHL values of my BAM files and see something unexpected. The sample BAM I can share holds alignments of amplicon bisulfite sequencing of five regions to a very high depth (5000-50000x) - can be downloaded from here due to size (40MB).

First, the -d option seem to be more restrictive than it should. Default (-d 10) leads to most of CpGs excluded: only 44 out of 138 with the coverage >1000 are included in the output. Specifying -d 100 leads to output with only 14 genomic positions.

Second, the MHL output values are quite strange. ~99% of reads in this BAM have almost no methylated cytosines and ~1% of reads are nearly fully methylated in CpG context. I expect to see MHL (much) lower than 0.01, but reported values include 0.65 for 17 positions, which does not seem to be correct.

Third, which is more a suggestion rather then an error - maybe metheor could skip records without XM tag, simply reporting their number. Unmapped records are allowed and often present in BAM and they shouldn't have XM in them. samtools can of course be used to create valid input for metheor, but maybe you consider this more permissive behaviour worth implementing.

And fourth (which is also a suggestion) - the positions reported should probably be 1-based rather than 0-based. Now it seems that CpG spans 3bp:

chr17 43125376 43125378 0.6556599

Tested this on arm64 Mac, manually compiled the latest metheor (2895cb3) with cargo build. I did not try the reference implementation of MHL, but maybe you have it up and running.

oleksii-nikolaienko commented 1 year ago

Hey, @dohlee, I wonder if you had a chance to look into this?