PengNi / ccsmeth

Detecting DNA methylation from PacBio CCS reads
BSD 3-Clause Clear License
73 stars 11 forks source link

model for CHG or CHH with HiFi reads #24

Closed baozg closed 1 year ago

baozg commented 2 years ago

Hi, @PengNi

Thanks for great tool!! It running very smoothly. But can we predict the CHG or CHH signal with HiFi reads like training 5mCpG model? It's very import for plant people. Is there are some limitation of subreads? Or just need some ground truth? WGBS or ONT?

PengNi commented 2 years ago

@baozg , thanks for using ccsmeth. Honestly, right now, I don't think CCS sequel II kit 2.0 will perform well on non-CpG methylation detection, though I didn't test that.

baozg commented 2 years ago

Why do you think it will be perform poorly with Sequel II Kit 2.0? Is there any library or sequencing bias for non-CpG methyaltion?

I don't know much detail about detecting kinetics. If I want to detect the non-CpG methylation, ONT will be a better choice? DeepSignal-plant seems pretty well on non-CpG methylation.

PengNi commented 2 years ago

The singals in CCS for 5mC detection are subtle. For CpG, we have to use its symmetric methylation feature, combine the signals from foward and reverse strand for higher accuracy. As for non-CpG, I don't think (I don't know) they have symmetric patterns. However, I don't have CCS plant data now, so I haven't done any tests.

baozg commented 2 years ago

CHG should have symmetric patterns,but CHH may cannot. We do have Arabidopsis thaliana HiFi data. The basic idea was using WGBS methyaltion sites for training. Can the ccsmeth train open training for CHG sites?

PengNi commented 2 years ago

I am not sure if current version of ccsmeth suits with CHG methylation. Maybe it is worth a try.

gaushi commented 2 years ago

Hi @PengNi With Arabidopsis I am getting the best accuracy of ~0.92 for CHG using default parameters. I have used bisulfite data as my ground truth. Do you think I can improve this further by playing with different parameters? Your help is much appreciated. Thanks! Best,

PengNi commented 2 years ago

Hi @gaushi , I am very suprised that you can get 0.92 accuracy on 5mCHG detection. How did you calculate the accuracy? Did you evaluate at read level or at genome site level?

As for improving the training performance, I' suggest that you can take more care of the training data (such as balancing). Different k-mer length maybe also worth to try.

gaushi commented 2 years ago

Hi @PengNi I did not quite get your question. Following is the part of the output.

Epoch [11/50], Step [4254/4254]; TrainLoss: 0.2267; ValidLoss: 0.2134, Acc: 0.9137, Prec: 0.9156, Reca: 0.9087, CurrE_best_acc: 0.9157, Best_acc: 0.9162; Time: 70.95s
early stop!
[main]train costs 12423.677060365677 seconds, best accuracy: 0.9162272727272728 (epoch 10)

I have balanced the dataset as you suggested on DeepSignalPlant. I had 2.2 million features for each methylated and unmethylated (with 100%support at minimum 8x coverage) dataset. I extracted features with different k-mer length (15,25) but it seems that 21 (default) performs the best so far. Thanks, Best,

PengNi commented 2 years ago

Thanks @gaushi , now I see. 0.916 ACC in validation set during training seems pretty high for CHG using CCS reads. For human CpG, I can only get sth like 0.87 ACC during training for now. As long as it is not overfitted (need to test the model on more independent testing data?), I think this performance is good enough.

Best, Peng

gaushi commented 2 years ago

@PengNi Thanks for that quick reply. I tested with another accession (ground truth -> bisulfite) and I get ~ 0.9ACC in validation set for CHG. I will cross-confirm with the two separate models (from two accessions) overall accuracy. Best, Gautam

AlineMuyle commented 2 years ago

The results on Arabidopsis CHG are super promising! Peng Ni, could you include a model for joint CG and CHG inferences please?

PengNi commented 2 years ago

Hi @AlineMuyle , thank you very much for your interest of ccsmeth. A model for joint CG and CHG inferences is definitely an awsome idea. But unfortunately I don't have any plant CCS data in my hand now. @gaushi , is it possible that you can share some of your CCS data? You can email me nipeng@csu.edu.cn for further discussion if you want to cooperate. Thank you very much!

Best, Peng

PengNi commented 1 year ago

See #32 .