Shians / NanoMethViz

https://shians.github.io/NanoMethViz/
Apache License 2.0
23 stars 1 forks source link

Accepting bam file input with modified read data. #24

Closed faulk-lab closed 1 year ago

faulk-lab commented 1 year ago

Now that bam files have standardized on including modified base information, with per-read data, is it possible for you to create an importing subroutine that can handle .bam input. Or can you suggest a way for me to convert bam files to the tabix format that Nanomethviz requires?

Oxford nanopore's guppy tool and dorado tool and PacBio's tools output modified bam files now so having import capability would be a huge benefit. I currently have many bams I would like to visualize in Nanomethviz. Here's a typical bam entry. Note the MM and ML flags.

34084713-cc44-4731-89fa-e6d800a16fc8 2048 chr1 1 27 312H110M1D6M1D42M1D28M1D80M1D6M1D28M1D18M1D32M1D12M1D40M1D3M1D17M1D2M1D22M1I11M2I13M1D2M2I13M9I5M2I6M1I49M4I17M2I19M1D5M5I13M5I30M5I3M1D17M1D18M6I196M1D194M1D22M1I30M1I43M4877H * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCTAACCCTAACCCTAACCCCAACCCCAACCCTAACCCTAACCCCAACCCCAACCCCAACCCTAACCCCAACCCCAACCCCAACCCTAACCCCAACCCCAACCCTAACCCCAACCCCAACCCCAACCCTAACCCCAACCCCAACCCCAACCCTAACCCCAACCCCAACCCTAACCCTAACCCCAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAACCCTAGCCCTTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGCCCTAGCCCTAACCCTAGCCCTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAGCCCTAGCCCTAACCCTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAGCCCTAGCCCTAGCCCTAGCCCTAGCCCTAGCCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTAGCCCTGACCCTGACCCTAGCCCTGACCCTGACCCTCACCCTCACCCTGACCCTCACCCTCACCCTCACCCTCACCCTAACCGTAACCCTGAAGCTGACCCTGACCCTGACCCTGACCCTAAGCCTGACCCTGACCCTAAGCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACGCTAAACCCTGACCCTCACCCTGACCCTGACGCTAAACCCTGACCCTGACCCTGACCCTGACCCTGACCCTAACCCTA GH@?@@BEEEEIFIEJJMJHCBBCCGBCDFKGECFHJJHCG>CCCIOPMLGHHGHFCCDEIGFEDHFGAABFDDBCEHHFDCEGHGFCCDFGGHHIGGABCCAEBBCG@ABBBBCIFDFFGFFCFIEEDCDEFFFDGGGFFDHJDD?FEFGJFGGEHHGGEEDFBFEECDCEDAFLFFEGGGDEDBFEBCBBBGFHHGIJECDDDEB@A<GGCCDCDOEHFFEFDDEHIJ@ACBBHIFGOEJAFC;ABCEFDFJILIDAA@EGHIGFGHCLIIHJIJFDEFJJIFFIEFDCEGDIGOILDHKFDDDDIFEFHJGEFDEEFGGFHGEBACEFGIFJHIKHGIHGEFIJH;BFIJ<JKCBB@CFFHIIHJLTJLGIJNIIIHGIKKHCEHGEJJDCFGHGDED9?<<<<@@DDGHGFILEM@HGB86EE@?@@EECABCEJ{GIPDEE@BAA?@>?AEEB@?BEDD>>66668ITHE@<>?BEB>>AFGC@@ACIEFEPGFFEDECDBABCEDDBBNIHBAA49/...1<;==>@?DFCCCBBBBBB<A;99:=A?==>>AA<:;@=8778:<??ABCEEEFGBIHA97AGBBBA@DDCGHCHFGIKHIHHFMJFGGJJJIHA:4-35//--.2/..//../,-**,/;;;?CCHIGFHHDJHHGFJKDJLFFGDIFFHHGKIKHFECCKIKKHGBCKBTPLNPAIJ;5:2CECIHEDDDDJHB>@AB+DDBEEFEGGEGHGGDEEGHKHEIGEEDDIF7777GFHHGEFFHHEDCEDFCHKGGJLFIHEGCGHFFFDDFEEGEFGEEEEFIGGG?BDG<<==FFKJDDDCFHEDDDBIECC>AAA@H>FECDEEFEFFGGJEGDFFEGDEFBBBBBFFBCDCFGECFEHFFEFFHGCECDHFEFCBFCDDEDFDDDCBE@AAAAIEFEDEEDDCEEHEFCDEFBBABCGEDBDEIDECCDGFDCECIHCHPJGGGHHJIFDBA@ABDDEHEEFCGEIGHCCCDEGDEEFEGDCEDCBDBDEDEBCGEFDABECCBBBEDFHHGEFFGHGIHFF@DCFCDCDEIFFGGEIGJGIJIGGFFFGFFGGFGKHFLHIFGBEHIEDCCDFHHHEDE@*****1224;>>??FIIH@FAAAA?@>=>>=DADFICCCBADDDFJGHLHHJOOMEEEFEMAAAAAIIGHGM{HFQM{NRLNIIMIH NM:i:177 ms:i:1409 AS:i:1370 nn:i:0 tp:A:P cm:i:18 s1:i:149 s2:i:176 de:f:0.1223 SA:Z:chr11,90103510,-,44S4144M680I1519S,23,1424; rl:i:1579 MM:Z:C+m?,694,41,14,22,1,1,5,2,1,2,1,0,1,3,1,6,1,1,2,0,1,1,1,2,2,1,1,1,2,1,8,2,2,2,1,0,1,6,2,1,1,1,3,1,2,2,3,1,1,3,2,5,3,3,2,2,1,4,1,3,1,3,2,1,1,2,1,3,1,2,1,1,1,1,3,1,1,1,1,3,0,1,1,1,2,2,1,1,1,2,1,8,2,2,2,1,0,1,6,2,1,1,1,3,1,2,2,3,1,1,3,2,5,2,3,2,2,1,4,1,3,1,6,2,1,1,2,1,3,1,2,1,1,1,1,3,1,1,1,1,2,0,1,1,1,2,1,1,2,2,1,8,2,2,2,1,0,1,3,1,0,6,3,3,1,2,1,2,1,1,1,2,2,0,1,3,2,1,3,5,1,1,0,6,4,1,3,2,0,1,1,1,1,1,1,1,3,3,1,2,2,4,1,2,1,5,0,3,2,5,2,2,1,5,2,1,1,3,1,2,3,3,1,2,1,3,1,3,1,2,1,4,2,4,1,5,2,2,1,2,2,2,3,5,2,1,2,2,2,1,3,2,2,2,2,6,1,1,1,2,1,4,2,1,3,2,1,2,2,1,1,6,2,1,2,6,2,1,1,3,1,3,1,4,1,1,0,1,8,3,1,1,2,3,3,1,2,1,3,3,1,2,2,2,1,0,4,2,1,2,3,1,1,3,2,1,2,3,7,1,0,1,3,1,3,1,2,7,1,2,2,2,1,3,1,3,6,3,1,2,1,3,3,1,1,3,0,5,1,3,4,3,1,2,3,3,2,2,1,7,4,2,2,2,1,3,1,2,1,1,1,2,2,2,3,1,2,6,3,1,3,1,2,1,3,3,1,2,1,3,4,1,2,3,2,0,1,1,2,2,2; ML:B:C,255,3,255,255,255,255,255,255,255,2,255,255,255,255,255,255,255,255,4,48,255,255,255,255,255,255,255,255,255,7,255,255,1,255,3,7,255,255,2,255,3,255,255,255,255,255,255,255,255,255,253,255,255,255,255,255,255,253,223,255,255,232,222,255,255,255,252,252,254,245,255,30,255,255,255,255,255,255,255,43,165,255,255,255,5,252,253,255,255,255,254,255,255,255,255,255,255,56,252,255,250,255,255,255,254,9,255,255,14,40,255,253,255,254,255,255,255,255,238,255,28,255,255,253,255,255,255,219,255,255,255,255,255,5,255,255,255,255,255,248,255,255,255,255,41,255,23,9,255,255,255,255,255,8,255,4,3,254,254,255,7,9,255,255,255,43,255,255,255,3,255,255,9,255,255,232,255,255,255,255,255,254,254,255,255,255,135,255,255,255,255,255,255,255,253,255,255,255,251,255,255,253,255,255,254,255,255,255,236,255,255,249,4,255,249,255,8,15,255,255,255,255,255,255,255,7,2,255,255,6,3,11,49,255,252,254,255,255,253,255,238,255,255,255,255,255,255,254,58,67,255,255,255,255,255,254,255,255,255,255,249,255,255,2,252,223,255,255,255,254,245,255,255,255,255,255,255,255,203,255,255,255,45,255,255,255,7,255,255,255,208,252,8,255,255,255,29,255,10,243,255,255,185,21,255,255,255,255,255,255,255,255,181,252,255,255,255,251,255,255,254,253,255,254,211,255,255,255,255,255,177,241,255,247,255,255,255,254,255,255,255,255,253,255,14,241,8,109,255,255,220,255,254,218,18,89,255,255,254,244,255,255,255,255,240,249,255,255,254,255,117,255,255,242,233,251,250,29,254,254,254,255,255,4,254,255,255,255,255,255,255,255,250,255,255,255,77,255,4,36,13

What can I do to help?

Shians commented 1 year ago

It's been on my mind for a while but I haven't had the time to implement it. The primary challenge is the fact that mod-bam is formatted relative to the read while NanoMethViz's framework assumes input coordinates relative to the reference genome. Importing will require that I parse the CIGAR and mod-bam strings carefully.

I think the current best solution is to go via https://github.com/epi2me-labs/modbam2bed and then add a subroutine to convert the resultant BED into NanoMethViz format. Eventually I will need to reimplement my own parser to do the conversion to BED as I'd like to keep all dependencies in R.

faulk-lab commented 1 year ago

I map the mod bams to the genome so they should have reference gene coordinates embedded. The example above was a poor example because it's repetitive, see below. You could restrict the input to require mapped bams only. It may be a problem that the mods are relative to the read though, I'm not sure.

I'm currently using modbam2bed to get summary files out for differential methylation. I'm not savvy enough to add a subroutine to convert the output to nanomethviz format myself. The bedmethyl file doesn't appear to have per-read output by default but modbam2bed does have a -c flag that allows "pileup". I have not used it so I don't know what the output looks like but if it spits out per read info, I could awk it into nanomethviz format. I will look tomorrow.

abbc4508-ed7a-4203-aa8f-a7e81e0c1050 2048 chr22 1046 1 1238H9M1I27M1I14M1D5M1D5M2I25M5D43M5I20M2I14M1D42M2I12M2I9M3D8M2D7M3D21M3I9M3I15M3D14M2D11M8478H * 0 0 ACCCTAACCACTAAACCTAACCCTAACCCCTACTGCTAAACCCTAACCCTAACCCCTACCGTTAAACCCTAACCCTAACCCTAATCCTAAACGCTAACCCTAAACCCTAACCCTAACCCTAGCCCTAAACCTAACAGAACCCCAACCCAAACCCTAAATCCAACCCCAAACCCAACCACAACCCTAATGCTAACACTAACCCCAATCCCATCTCTAAAACCATAACCCTAAAATCCTAACAACCCTAAATCTAACAACCCTAACCCTAACACTAACAAACCTAATGTTAGCAACCCTAAACCTAACAACTGTAACCCTGACTGACCCTAAC BJHIJFGGIKIHFHGHHKJLHEGGFGGIJLKCCBBDECCCADFBEHFFJIHJDFFIICC><<;:::887779;BIIHMGNIIKKJEEGEE6,-?=>ABIKGIGEEEGAHGEGJGGGDDFFEDCBB@EFEEDCBBEFGGGIKLIGGDEGFGGDDFEEGGFMHHFGFJGGGHHGGGEDFDFFEFFFBBAA@CDCDEEEFHDEHIFGEDCBDDCEEGFFFIJHHGEFGECCAHHHHOOIFIHHFFEHLHDEDFIMHKHFDFEFGEFFDIIGGFGIGJGHFMHJGGFHCCBBBBABCBDFIGIIJIEFDEEEEFEDELLHDCBCEFDEFCCBAB@ ms:i:273 AS:i:248 nn:i:0 tp:A:P cm:i:3 s1:i:44 s2:i:43 de:f:0.1646 SA:Z:chr8,20140,+,2108S7278M620I41S,41,2209;chr13,113555973,+,1224M137D8823S,14,284; rl:i:3000 MM:Z:C+m?,32,19,5,13,1,11,121,0,61,11,134,12,11,20,26,41,6,18,8,10,7,28,3,9,19,13,4,105,27,13,27,3,8,20,4,9,37,5,90,8,5,15,20,4,3,4,1,0,3,45,1,2,12,27,20,20,0,11,0,2,1,9,6,0,12,0,48,11,0,0,12,4,15,22,10,7,41,9,0,7,3,31,26,32,2,7,9,5,18,15,55,0,0,10,0,1,1,6,10,8,4,36,14,5,38,19,5,5,1,7,2,1,3,1,7,2,1,3,1,1,1,1,2,4,2,0,0,5,0,2,5,3,3,7,11,1,3,1,2,3,24,0,0,1,3,0,4,0,3,2,3,1,2,6,1,0,0,4,1,3,6,1,4,2,6,7,0,0,1,33,0,11,13,21,8,1,4,4,14,14; ML:B:C,2,255,255,255,255,255,255,255,255,255,252,255,255,2,3,255,255,255,255,255,255,2,255,255,255,255,255,255,255,4,255,255,255,1,255,255,255,255,3,211,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,251,255,2,2,255,255,255,255,255,255,2,255,255,255,109,2,255,255,255,250,243,255,255,255,255,255,1,2,255,2,255,255,255,5,255,255,255,255,255,101,255,255,255,255,255,4,254,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,178,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255 MD:Z:13C17A0C0C16^C5^A3C21C4^ACCCT3C28A6C8T5T10T4T2C3^T4C0T8C0C5C7T2C2T1A1C6C9C6^CCT8^CC1C5^CCT15C10C0C0C9C6^CCT3C0C7A1^CC2A8 NM:i:78

Shians commented 1 year ago

I'm not entirely sure what bedmethyl looks like, it was just from reading their documentation that it sounds like it output what we need, that is a read name, genomic position and some kind of methylation probability statistic. However it also sounds like they might be aggregating it at each position which is no good since we need the data for each read.

faulk-lab commented 1 year ago

Yes it is aggregated data. There is one row per CpG position, and columns list the number of unmodified, and modified calls along with percentage and other info. It's not per-read. Here's an example.

chr1 3799 3800 5mC 862 + 3799 3800 0,0,0 29 48.00 13 12 0 2 0 chr1 3800 3801 5mC 764 - 3800 3801 0,0,0 17 46.15 7 6 0 2 0 chr1 5311 5312 5mC 901 + 5311 5312 0,0,0 51 43.48 26 20 0 2 0 chr1 5312 5313 5mC 970 - 5312 5313 0,0,0 34 48.48 17 16 0 1 0 chr1 5608 5609 5mC 879 + 5608 5609 0,0,0 58 88.24 6 45 1 2 0 chr1 5609 5610 5mC 692 - 5609 5610 0,0,0 39 70.37 8 19 0 1 0 chr1 5610 5611 5mC 844 + 5610 5611 0,0,0 58 87.76 6 43 1 4 0 chr1 5611 5612 5mC 666 - 5611 5612 0,0,0 39 73.08 7 19 0 6 0

faulk-lab commented 1 year ago

I'm seeing that the modbamtools package is pulling out individual reads and plotting their methylation, so it can be done using the mapped bam after all.

Shians commented 1 year ago

Preliminary support has been implemented as of 15f03dd. I have only tested this with a small dataset that I've been given and have yet to fully document the feature. But it can be used by replacing the use of NanoMethResult with a ModBamResult object constructed as follows:

mbr <- ModBamResult(
    methy = ModBamFiles(
        samples = c("sample1", "sample2"),
        paths = c("/path/to/sample1", "/path/to/sample2")
    ),
    samples = samples, # sample annotation data frame
    exons = exons # exon annotations
)

all plots except for MDS should work. I have only considered MD and ML tags on the modbam with cytosine methylation so I can't guarantee it works with other modifications.