Closed baozg closed 1 year ago
@baozg , thanks for using ccsmeth. Honestly, right now, I don't think CCS sequel II kit 2.0 will perform well on non-CpG methylation detection, though I didn't test that.
Why do you think it will be perform poorly with Sequel II Kit 2.0? Is there any library or sequencing bias for non-CpG methyaltion?
I don't know much detail about detecting kinetics. If I want to detect the non-CpG methylation, ONT will be a better choice?
DeepSignal-plant
seems pretty well on non-CpG methylation.
The singals in CCS for 5mC detection are subtle. For CpG, we have to use its symmetric methylation feature, combine the signals from foward and reverse strand for higher accuracy. As for non-CpG, I don't think (I don't know) they have symmetric patterns. However, I don't have CCS plant data now, so I haven't done any tests.
CHG should have symmetric patterns,but CHH may cannot. We do have Arabidopsis thaliana HiFi data. The basic idea was using WGBS methyaltion sites for training. Can the ccsmeth train
open training for CHG sites?
I am not sure if current version of ccsmeth suits with CHG methylation. Maybe it is worth a try.
Hi @PengNi With Arabidopsis I am getting the best accuracy of ~0.92 for CHG using default parameters. I have used bisulfite data as my ground truth. Do you think I can improve this further by playing with different parameters? Your help is much appreciated. Thanks! Best,
Hi @gaushi , I am very suprised that you can get 0.92 accuracy on 5mCHG detection. How did you calculate the accuracy? Did you evaluate at read level or at genome site level?
As for improving the training performance, I' suggest that you can take more care of the training data (such as balancing). Different k-mer length maybe also worth to try.
Hi @PengNi I did not quite get your question. Following is the part of the output.
Epoch [11/50], Step [4254/4254]; TrainLoss: 0.2267; ValidLoss: 0.2134, Acc: 0.9137, Prec: 0.9156, Reca: 0.9087, CurrE_best_acc: 0.9157, Best_acc: 0.9162; Time: 70.95s early stop! [main]train costs 12423.677060365677 seconds, best accuracy: 0.9162272727272728 (epoch 10)
I have balanced the dataset as you suggested on DeepSignalPlant. I had 2.2 million features for each methylated and unmethylated (with 100%support at minimum 8x coverage) dataset. I extracted features with different k-mer length (15,25) but it seems that 21 (default) performs the best so far. Thanks, Best,
Thanks @gaushi , now I see. 0.916 ACC in validation set during training seems pretty high for CHG using CCS reads. For human CpG, I can only get sth like 0.87 ACC during training for now. As long as it is not overfitted (need to test the model on more independent testing data?), I think this performance is good enough.
Best, Peng
@PengNi Thanks for that quick reply. I tested with another accession (ground truth -> bisulfite) and I get ~ 0.9ACC in validation set for CHG. I will cross-confirm with the two separate models (from two accessions) overall accuracy. Best, Gautam
The results on Arabidopsis CHG are super promising! Peng Ni, could you include a model for joint CG and CHG inferences please?
Hi @AlineMuyle , thank you very much for your interest of ccsmeth. A model for joint CG and CHG inferences is definitely an awsome idea. But unfortunately I don't have any plant CCS data in my hand now. @gaushi , is it possible that you can share some of your CCS data? You can email me nipeng@csu.edu.cn for further discussion if you want to cooperate. Thank you very much!
Best, Peng
See #32 .
Hi, @PengNi
Thanks for great tool!! It running very smoothly. But can we predict the CHG or CHH signal with HiFi reads like training 5mCpG model? It's very import for plant people. Is there are some limitation of subreads? Or just need some ground truth? WGBS or ONT?