PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
70 stars 6 forks source link

Run the pipeline on prokaryotic dataset #49

Closed pailloufat-stack closed 1 year ago

pailloufat-stack commented 1 year ago

Hi, I have 5 prokaryotic datasets. I know the HK model was trained on human and mouse DNA, but could I trust the results of pb-CpG-tools on my datasets ? In the 5 *.bed results files, there are about 75,000 5mC sites detected. Best

ctsa commented 1 year ago

pb-CpG-tools provides a utility to summarize site methylation probabilities, this tool is separate from the process used to call 5mC modifications on individual reads.

The site methylation probabilities can be summarized using either a machine-learning model or a simpler pileup count model (see https://github.com/PacificBiosciences/pb-CpG-tools#output-modes-and-option-details). If you'd like a control for the machine-learning model you might consider running this tool in count pileup mode to see if that better fits your intuition for these samples.

pailloufat-stack commented 1 year ago

Thanks for your reply. I'll try. I have one extra question : is the model pileup_calling_model.v1.tflite should be only use for human datasets? I mean, this model would infer only 5mC human patterns because it is trained on the M.SssI-treated and amplified DNA human datasets?

ctsa commented 1 year ago

The generalization of the machine-learning model across species hasn't been systematically evaluated. Given the training scheme, I might expect it would be applicable to something like mouse, but it is not clear that it would transfer to prokaryotes. I would start with the count pileup mode, plus any truth data, and compare these against the model pileup output for this case to access whether the model can reasonably be applied here.

pailloufat-stack commented 1 year ago

--pileup-mode count and --model pb-CpG-tools-v2.3.1-x86_64-unknown-linux-gnu/models/pileup_calling_model.v1.tflite argument give exactly the same results on my 5 datasets

ctsa commented 1 year ago

Hi Sorry for the delay in getting to this. I did some quick tests to try to reproduce what you describe but haven't had any luck. Can you share a more specific example of the problem?

pailloufat-stack commented 1 year ago

Hi, no worries.

I have 5 prokaryotic datasets, and I am looking for methylation motifs, without bias. I already ran the ipdSummary pipeline to detect the 4mC and 6mA motifs. It went pretty well. For the 5mC motifs, what I did for the moment :

1 - Run primrose on my CCS reads 2 - Alignment with pbmm2 3 - Run pb-CpG-tools-v2.3.1-x86_64-unknown-linux-gnu/bin/aligned_bam_to_cpg_scores with both arguments modeand count

I got (I only show the first rows for 1 dataset) :

I have one extra question (I am very new to machine learning) : is the CNN model trained on amplified DNA (unmethylated) and M.SssI-treated DNA (methylated) could be compared to the in silico control of the ipdSummary pipeline ? I mean, both pipelines "compare" data to an existing model, trained and tested on a previous dataset?

ctsa commented 1 year ago

Thanks @pailloufat-stack, I'm not sure we can comment on ipdSummary in depth here - the core modeling concepts are still relevant but have been updated with a more data-driven approach given the availability of a large positive training set from M.SssI-treatment. I'm not up to date if the original item on this ticket is still an issue so please open a new ticket if this is the case.