Closed pailloufat-stack closed 1 year ago
pb-CpG-tools provides a utility to summarize site methylation probabilities, this tool is separate from the process used to call 5mC modifications on individual reads.
The site methylation probabilities can be summarized using either a machine-learning model or a simpler pileup count model (see https://github.com/PacificBiosciences/pb-CpG-tools#output-modes-and-option-details). If you'd like a control for the machine-learning model you might consider running this tool in count
pileup mode to see if that better fits your intuition for these samples.
Thanks for your reply. I'll try.
I have one extra question : is the model pileup_calling_model.v1.tflite
should be only use for human datasets? I mean, this model would infer only 5mC human patterns because it is trained on the M.SssI-treated and amplified DNA human datasets?
The generalization of the machine-learning model across species hasn't been systematically evaluated. Given the training scheme, I might expect it would be applicable to something like mouse, but it is not clear that it would transfer to prokaryotes. I would start with the count
pileup mode, plus any truth data, and compare these against the model
pileup output for this case to access whether the model can reasonably be applied here.
--pileup-mode count
and --model pb-CpG-tools-v2.3.1-x86_64-unknown-linux-gnu/models/pileup_calling_model.v1.tflite
argument give exactly the same results on my 5 datasets
Hi Sorry for the delay in getting to this. I did some quick tests to try to reproduce what you describe but haven't had any luck. Can you share a more specific example of the problem?
Hi, no worries.
I have 5 prokaryotic datasets, and I am looking for methylation motifs, without bias. I already ran the ipdSummary
pipeline to detect the 4mC and 6mA motifs. It went pretty well. For the 5mC motifs, what I did for the moment :
1 - Run primrose
on my CCS reads
2 - Alignment with pbmm2
3 - Run pb-CpG-tools-v2.3.1-x86_64-unknown-linux-gnu/bin/aligned_bam_to_cpg_scores
with both arguments mode
and count
I got (I only show the first rows for 1 dataset) :
mode
:
TRV62_NO042_chromosome_flye 31 32 7.3 Total 677 49 628 7.2
TRV62_NO042_chromosome_flye 59 60 5.0 Total 681 34 647 5.0
TRV62_NO042_chromosome_flye 225 226 4.1 Total 683 28 655 4.1
TRV62_NO042_chromosome_flye 415 416 4.1 Total 684 28 656 4.1
count
TRV62_NO042_chromosome_flye 31 32 11.5 Total 677 78 599 0.733 0.083
TRV62_NO042_chromosome_flye 59 60 16.0 Total 681 109 572 0.752 0.109
TRV62_NO042_chromosome_flye 225 226 16.7 Total 683 114 569 0.734 0.143
TRV62_NO042_chromosome_flye 415 416 34.5 Total 684 236 448 0.747 0.169
I have one extra question (I am very new to machine learning) : is the CNN model trained on amplified DNA (unmethylated) and M.SssI-treated DNA (methylated) could be compared to the in silico control of the ipdSummary
pipeline ? I mean, both pipelines "compare" data to an existing model, trained and tested on a previous dataset?
Thanks @pailloufat-stack, I'm not sure we can comment on ipdSummary in depth here - the core modeling concepts are still relevant but have been updated with a more data-driven approach given the availability of a large positive training set from M.SssI-treatment. I'm not up to date if the original item on this ticket is still an issue so please open a new ticket if this is the case.
Hi, I have 5 prokaryotic datasets. I know the HK model was trained on human and mouse DNA, but could I trust the results of
pb-CpG-tools
on my datasets ? In the 5*.bed
results files, there are about 75,000 5mC sites detected. Best