GMOD / jbrowse-components

Source code for JBrowse 2, a modern React-based genome browser
https://jbrowse.org/jb2
Apache License 2.0
211 stars 63 forks source link

Improved rendering of modifications #4647

Closed cmdcolin closed 2 weeks ago

cmdcolin commented 2 weeks ago

Rendering before this PR

image

the current main branch does not really consider the modification probability deeply, but it is actually important to do so to properly visualize it. this is because the data can report multiple different modifications at the same position, and it is best to only render the most probable one.

SAMtags.pdf shows an example where both 5mc and 5hmc are listed for each position, and the user should choose the modificaiton that has the highest probability (or even choose "no modification" if neither is a high probability. specifically, multiple modifications at the same position can't sum up to greater than 1.0 probability, and i believe, chemically, only one modification is even possible. otherwise it would get a chemical code. therefore, double counting modifications at a single position is misleading, but that's what we have in our UI on main

Rendering after this PR

image

IGV rendering

i modified the rendering to more closely follow IGV. it is a bit copy cat, but IGV does a lot of things right i believe.

image

text from SAMtags about the probabilities in the ML tag

https://samtools.github.io/hts-specs/SAMtags.pdf

"Note where several possible modifica- tions are presented at the same site, the ML values represent the absolute probabilities of the modifi- cation call being correct and not the relative likelihood between the alternatives. These probabilities should not sum to above 1.0 (≈ 256 in integer encoding, allowing for some minor rounding errors), but may sum to a lower total with the remainder representing the probability that none of the listed modification types are present. In the example used above, the 6th C has 80% chance of being 5mC, 10% chance of being 5hmC and 10% chance of being an unmodified C"

coverage calculation

Now in coverage: now in modification mode the snpcoverage does not draw the raw number of modifications at the position, but the proportion of "modifiable" bases at that position. this aligns better with user expectations and is what igv does. therefore, for a CpG, on the forward strand, only half the reads will be a C at the CpG position, and the other half will be G on the reverse strand, but only the C can be methylated. but if all the C's there are methylated, then basically we can draw that as that this position is "fully modified", therefore maxing out the y-axis of the coverage track

modification colors

I adjusted the color scheme of modifications to match IGV

retained the "methylation" mode

i was hoping to maybe get rid of this, but the data files for BAM simply do not indicate unmethylated positions in some cases

igv is not to my knowledge, in this case, is not able to draw the unmethylated cpg's

image

image

cmdcolin commented 2 weeks ago

this is a follow up to closed PR here https://github.com/GMOD/jbrowse-components/pull/4642