More sequence contexts for modified base calls (feature request)

TeiturAK commented 2 years ago

Hi,

thank you for a very convenient and easy to use tool!

I work with plants where, apart from CpG, CHH and CHG motifs are also common for 5mC modifications.

Would it be possible to have these added sequence filters added as a feature similar to the --cpg flag?

Best Teitur

colindaven commented 2 years ago

I second this request, if this is the right place for it - and yes, also the thanks for the nice easy tool

cjw85 commented 2 years ago

I'll look into this, the CpG filter was thrown in as an afterthough as something that would cut down the size of the output.

I'm thinking about how multiple options given on the command-line may combine to produce a result. Do you want to be able to produce files with both CHH and CHG together? And still keep a record of what's what? There's scope to use column 4 of the output to record e.g. "5mC,CHH" and "5mC,CHG". My recollection is that Bismark outputs distinct files. Again this is all a bit related to their being now good specification for recording this data (we seriously considered using VCF instead, treating modification as a substitution and using allele frequency).

colindaven commented 2 years ago

For my part, CpG, CHG and CHH should be separate options producing separate files. I think they will be primarily viewed in JBrowse etc, where different tracks are most effective.

I also think Bismark produced three separate files. I would strongly prefer to use the bedMethyl format with conversion to Bigwig for visualization, rather than squashing it into VCF. This was well accepted in the past, whereas VCF would be confused with SNV data.

Thanks

cjw85 commented 2 years ago

whereas VCF would be confused with SNV data.

This is interesting, as I think we will begin to see a cultural shift away from treating modified bases as distinct from substitutions to a different base in the set {A,C,G,T}. We shied away from using VCF as we thought it would be too much for some to bear. As detailed in the README of this project bedMethyl is so poorly defined. That was OK when bisulfite and Bismark were the only kids in town, but now causes real headaches for anyone trying to implement basecallers and aggregators. We also thought of breaking from bedMethyl entirely and more solidly defining a BED3+N format; but decided that users would complain (somewhat ironically) of us not using a standard format.

See also comments here, which start to peel back the layers of wanting to nicely partition modification as a discrete phenomenon.

colindaven commented 2 years ago

I've thought about this and understand why you want to move to VCF long term to represent all modified bases accurately. If you are just working with 5mC, then a single bedGraph/bigwig is fine. But if you have 10+ modifications, as is potentially possible with ONT sequencing, then these simple quantitative formats are not sufficient.

Yet VCF is not really used as a quantitative format (at least, yet, to my knowledge). Biologists really want to inspect data like methylated bases against a genomic feature backdrop, as they do with RNA-seq in bigwig format. However, some genome browsers like IGV and JBrowse show VCF SNVs as bars with total bases vs alternative bases in stacks, which works very well. But how would you do this from one VCF with all SNVs and all modified bases, the information density would be huge.

It would be great if bedMethyl or the VCF changes discussed here were formally picked up by eg GA4GH, though I see not all formats have been covered there yet. https://www.ga4gh.org/genomic-data-toolkit/

Potentially, all that is needed is a VCF to bedgraph/bigwig converter for tracks which are intended to be quantitative - provided the VCF is carefully designed and specified ....

cjw85 commented 2 years ago

Use of VCF would have to make heavy use of tags, much like a lot of variant calling debugging information is placed into VCFs. Many of the methods for calling methylation with Nanopore data are approaching the problem in a very analogous way to how variant calling of canonical bases is performed (with some sort of hypothesis testing between [incomplete] sets of possibilities). I think I'm correct in saying that the current Guppy modified-base basecalling is the only contemporary method that doesn't do this (making a free choice of call); though the techniques in bonito-remora return to something more like variant calling.

More pragmatically, I'm working today to output independent CpG, CHG, and CHH bed files given command-line boolean flags of --cpg --chh --chg. 😄

cjw85 commented 2 years ago

I have this afternoon started on an implementation of combined CpG/CHG/CHH filtering to separate files. This is currently on a branch if anyone wants to test it. I haven't thoroughly tested this code, there's a lot of fiddly logic to filter the motifs.

https://github.com/epi2me-labs/modbam2bed/tree/multifilt

TeiturAK commented 2 years ago

Thank you very much for this! I will try it out!

colindaven commented 2 years ago

Thanks @cjw85 I will give it a go and report back in a few weeks.

cjw85 commented 2 years ago

I've just written some tests for the CHH and CHG motif detection and filtering. Version 0.5.1 should appear shortly on GitHub and the conda repository. Please give it a try and let me know if it doesn't behave as intended; if someone can provide a smallish test dataset I'm happy to try it.

TeiturAK commented 2 years ago

I have now given the new branch a try with the --chh and --chg flags. I've analyzed the sequence context +/- 3 bp around the methylated Cs using other software to confirm that I'm getting the expected output! I've not found any discrepancies.

Thank you very much for this!

colindaven commented 2 years ago

For those coming later, a potential command is something like

modbam2bed -m 5mC -t 8 -e x.fasta mod_mappings.sorted.bam --cpg --chg --chh

epi2me-labs / modbam2bed

More sequence contexts for modified base calls (feature request) #9