About the sequencing chemistry in sequel

ZhangBio commented 2 years ago

HI! Happy to see a sequel version of SMALR. The chemistries in sequel seem to be more described as "sequencing kit v xxx" How should I set "--model“, if I only know the version of sequencing kit, do you know the relationship between the version of sequencing kit and SP2-C2 or SP3-C3?

ZhangBio commented 2 years ago

Like I know there are Sequel II sequencing kit 1.0, Sequel II sequencing kit 2.0. It sounds like they are different, but they could both use SP3-C3?

GDelevoye commented 2 years ago

Hi,

It was never 100% clear to me.

My understanding is that :

RS II = P6-C4
Sequel I = SP2-C2
Sequel II v1 : No model
Sequel II v2 : SP2-C2 works OK, SP3-C3 works better

Except the difference between Sequel II v1 and Sequel II v2, it seems that for a same sequencer, different sequencing kits can be used interchangeably without having to switch the in-silico control model.

I opened an issue at the PacBio KineticsTools two years about a similar subject; see here

There you can read one PacBio developer say; I quote :

rhallPB commented on 28 Jan 2020

Note also, S2-P2 model has been shown to be effective with Sequel II chemistry version 2.0. We don't have a good model for Sequel II chemistry version 1.0.

Which is consistent with the summary I made above

GDelevoye commented 2 years ago

I never thought anyone would be interested in my software. Please let me know if I can help anyhow in your analysis.

ZhangBio commented 2 years ago

Thank you very much for your reply! The kinetics features in P6-C4 and P5-C3 P4-C2 model provided in SMRT analysis v2.3 seem to be a lot different. It will be much easier if the sequel could share the same model. There are so little information on the internet, it's really precious to have your reply.

ZhangBio commented 2 years ago

Actually, I would expect works like SMALR will have more attention since methylation heterogenenity in prokaryotes could be very important. This software will definely be helpful when people want to conduct analysis using today's sequel data!

GDelevoye commented 2 years ago

Yes, information on the subject is very hard to find indeed

Like you say P6-C4 and P5-C3 P4-C2 are very different but I presumed (maybe I was wrong) that these were just incremental upgrades with P6-C4 just being the "best" (?)

I will do my best to find the right sources in a near future, and compile them in the README

Until then what I can tell with certitude is that the SP2-C2 model worked great on our Sequel I E. coli data.

I never tested on Sequel II data. If you have SMSN data produced with a Sequel II sequencer that can be used for benchmark, I would be glad to help

I'm letting this issue opened for the moment and I'll close it when I'll find more informations on the models versus sequencing kits. Maybe I could even just parse the header to match it automatically; I just did not do it yet because I thought no one else would use it

You can also have a look at this repo where I did some retro-engineering of the in silico control

https://github.com/EMeyerLab/ipdtools

I did this at the time where SP3-C3 was not yet released, and before the model formats changed, but I'm reasonably confident that the repo is still valid as of June 2022

ZhangBio commented 2 years ago

A method is compare the "tMean" and "modelPrediction" with WGA data, if the 2 values are good correlated, it means it's using the correct model, maybe I'll try this later when I have time. But a recent paper shows the correlationship between obeserved IPD and predicted IPD in WGA is not good, I dont know where is wrong. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08471-2

GDelevoye commented 1 year ago

Hi, sorry for the late answer.

To my experience there is a systematic biais between the observed IPDs and the modelPrediction which, as you mention, kind of prevents this kind checking...

My own verification on my own data was that, using the SP2-C2 model on my E. coli data, almost all the DNA modifications that I can detect were located either in GATC or EcoK sites, which are indeed known for their abundance of 6mA. But I do not have access to more recent SMSN Sequel II data. I would be glad if someone could provide me some

I have a bit of time to take care of that issue at the moment... Are you still interested in using the software ? Do you have any data that may help ?

Guillaume

GDelevoye commented 1 year ago

Perhaps this repo that I have made a few years ago now, could help to test your suggested solution.

ZhangBio commented 1 year ago

Sorry for the late reply. I used some man-made data from previous researches where "positive" are treated by MTase, and corresponding controls are WGA data. But "A" sites in MTase treated group seem not always predicted to be methylated. I'm can't tell whether it's the problem of ipdsummary or the enzyme treatment efficiency. https://www.ncbi.nlm.nih.gov/sra/SRX12017172[accn] https://www.ncbi.nlm.nih.gov/sra/SRX9611878[accn]

GDelevoye / SMSN

About the sequencing chemistry in sequel #1