dorado modified-bases 5mCG_5hmCG visualization

wietingj commented 6 months ago

Hi Shian,

The latest dorado basecalling models, e.g. dna_r10.4.1_e8.2_400bps_hac@v4.3.0, only support combined 5mCG_5hmCG or 5mC_5hmC for modified basecalling, not just 5mCG as before. However, the newest Oxford chemistry requires the latest basecalling models. So the modbams will contain both 5mCG AND 5hmCG modifications.

Using the NanoMethViz ModBamFiles and the ModBamResult function, this seems to lead to the following phenomenon regarding the visualization in NanoMethViz using plot_regions (avg_method = "mean", but same for "median"): The aggregated trends move far below the visualized individual readings/spaghetti. Here is an example with this effect using modbams from dorado basecaller --modified-bases 5mCG_5hmCG and below an example derived from an older basecalling model supporting only 5mCG to demonstrate the difference.

example_5mCGand5hmCG.pdf Example_5mCGonly.pdf

I guess the individual 5hmC reads/spaghettis move around the zero line, but for aggregation it is averaged together with 5mC? Or might there be another explanation for this effect?

Is there a way to visualize 5mC and 5hmC separately like in adamewing methylartist? Or might 5hmC be calculated out?

Am i right, that this problem also affects the statistics when I use the interface? modbam_to_tabix(mbr), methy_to_bsseq(mbr_tabix) for use in DSS differential methylation analysis?

Thanks for your answer and help!

Best regards!

Shians commented 6 months ago

Hello, very sorry for the late reply. I am actually not sure what is causing this, NanoMethViz should only be visualising one of the methylation codes, either "m" (default) or "h" (5hmC) depending on how you constructed your ModBamResult. There is no way to do both at the same time, but they should always be done separately and having 5hmC in the data shouldn't change the plotting.

Could you download the latest version of NanoMethViz and try again? If it's still broken then could you use query_methy to extract out data from the problematic region so I can have a look?

wietingj commented 6 months ago

Hey, thanks for your reply. I'm still facing the same problem with 3.0.1. Here is the output:

sessionInfo() R version 4.4.0 (2024-04-24) Platform: aarch64-apple-darwin20 Running under: macOS Sonoma 14.4.1 other attached packages: [1] NanoMethViz_3.0.1 ggplot2_3.5.1 BiocManager_1.30.23

sample <- c("5mCG_5hmCG_testfile_a", "5mCG_5hmCG_testfile_b") group <- c("testA","testB") sample_anno <- data.frame(sample, group, stringsAsFactors = FALSE) exon_tibble <- get_exons_hg38() methy <- ModBamFiles(samples = sample, c("/Volumes/SSD1TB/test_files/5mCG_5hmCG_testfile_a.bam", "/Volumes/SSD1TB/test_files/5mCG_5hmCG_testfile_b.bam")) mbr <- ModBamResult(methy, sample_anno, exon_tibble) plot_region(mbr,"chr15",51277199,51286600, avg_method = "mean",spaghetti = T,heatmap = F) [W::hts_idx_load3] The index file is older than the data file: /Volumes/SSD1TB/test_files/5mCG_5hmCG_testfile_a.bam.bai [W::hts_idx_load3] The index file is older than the data file: /Volumes/SSD1TB/test_files/5mCG_5hmCG_testfile_b.bam.bai

testmethy5mCG5hmCG_newversion

example_query <- query_methy(mbr, "chr15",51277199,51286600) [W::hts_idx_load3] The index file is older than the data file: /Volumes/SSD1TB/test_files/5mCG_5hmCG_testfile_a.bam.bai [W::hts_idx_load3] The index file is older than the data file: /Volumes/SSD1TB/test_files/5mCG_5hmCG_testfile_b.bam.bai

head(example_query, 20) sample chr pos strand statistic read_name mod_prob
1 5mCG_5hmCG_testfile_a chr15 51277293 + 2.18 ea5ba00f-267a-4879-b767-2f692072… 0.898 2 5mCG_5hmCG_testfile_a chr15 51277446 + 4.84 ea5ba00f-267a-4879-b767-2f692072… 0.992 3 5mCG_5hmCG_testfile_a chr15 51277460 + 3.57 ea5ba00f-267a-4879-b767-2f692072… 0.973 4 5mCG_5hmCG_testfile_a chr15 51277626 + 5.54 ea5ba00f-267a-4879-b767-2f692072… 0.996 5 5mCG_5hmCG_testfile_a chr15 51277628 + 0.521 ea5ba00f-267a-4879-b767-2f692072… 0.627 6 5mCG_5hmCG_testfile_a chr15 51277819 + -5.54 ea5ba00f-267a-4879-b767-2f692072… 0.00392 7 5mCG_5hmCG_testfile_a chr15 51278070 + 4.43 ea5ba00f-267a-4879-b767-2f692072… 0.988 8 5mCG_5hmCG_testfile_a chr15 51278072 + 4.14 ea5ba00f-267a-4879-b767-2f692072… 0.984 9 5mCG_5hmCG_testfile_a chr15 51278114 + -5.54 ea5ba00f-267a-4879-b767-2f692072… 0.00392 10 5mCG_5hmCG_testfile_a chr15 51278240 + -5.54 ea5ba00f-267a-4879-b767-2f692072… 0.00392 11 5mCG_5hmCG_testfile_a chr15 51278272 + 1.84 ea5ba00f-267a-4879-b767-2f692072… 0.863 12 5mCG_5hmCG_testfile_a chr15 51278277 + 2.09 ea5ba00f-267a-4879-b767-2f692072… 0.890 13 5mCG_5hmCG_testfile_a chr15 51278345 + 3.43 ea5ba00f-267a-4879-b767-2f692072… 0.969 14 5mCG_5hmCG_testfile_a chr15 51278349 + 5.54 ea5ba00f-267a-4879-b767-2f692072… 0.996 15 5mCG_5hmCG_testfile_a chr15 51278454 + 3.91 ea5ba00f-267a-4879-b767-2f692072… 0.980 16 5mCG_5hmCG_testfile_a chr15 51278606 + -Inf ea5ba00f-267a-4879-b767-2f692072… 0 17 5mCG_5hmCG_testfile_a chr15 51278614 + -5.54 ea5ba00f-267a-4879-b767-2f692072… 0.00392 18 5mCG_5hmCG_testfile_a chr15 51278655 + -Inf ea5ba00f-267a-4879-b767-2f692072… 0 19 5mCG_5hmCG_testfile_a chr15 51278785 + 4.84 ea5ba00f-267a-4879-b767-2f692072… 0.992 20 5mCG_5hmCG_testfile_a chr15 51278880 + 4.43 ea5ba00f-267a-4879-b767-2f692072… 0.988

Thanks for your help!

Bests!

Shians commented 6 months ago

If it's not too big, could you use saveRDS() to save the example_query data and upload it here along with a data frame of the sample and group columns of your ModBamResult object?

wietingj commented 6 months ago

Here is the example data:

example_query.rds.gz example_sample_anno.rds.gz

Best regards.

Shians commented 6 months ago

I think you should be able to set options("NanoMethViz.site_filter" = 5L) before you run any plotting.

This is an issue with ModBAM calling methylation along the reads rather than the genome, allowing sequencing errors to produce a lot of spurious CG sites in the sequence that is not present in the genome. These sites are generally not called methylated and when there are many of them it drags the average down significantly as seen here. The option set allows sites that don't have at least 5 coverage to be excluded, removing many of these sites that arise from sequencing error.

The default is set to just 1L, I may increase it to 3 by default in the future but I am worried about people doing low-coverage work. Alternatively I could provide a reference filtering function that takes a reference genome and deletes any sites which don't match the motif in the reference.

wietingj commented 6 months ago

Thanks for the detailed feedback. For the small test dataset, "NanoMethViz.site_filter" = 5L leads to a sufficient result. However, when applied to larger datasets, I would have to use "NanoMethViz.site_filter" = 30L to avoid the above phenomenon with the aggregated trends, which could lead to a significant loss of data. In this respect, I think it would be great if you could provide a reference filtering function instead. Thanks in advance. Best regards.

Shians / NanoMethViz

dorado modified-bases 5mCG_5hmCG visualization #38