What are these kmers? - Githubissues

gurbonaite commented 1 year ago

I have been using m6Anet a lot and I get out that table with position, prob, kmer, etc. I have recently pulled out information on just one transcript (randomly chosen) and when I look at the positions in that transcript that are methylated they do not correspond to the kmer that is output by the table. I have done this for multiple transcripts. Additionally, if I sum up all the kmers over one transcript and get a count on how often each one appears, this does not match up with the number of times the kmer actually appears in the gene/transcript. Am I missing something?

chrishendra93 commented 1 year ago

hi @gurbonaite, thanks for using m6Anet and posting this issue. Can I check with you how you perform the comparison because I am not sure I am getting your question correctly? So currently the kmer that you observe in the table is assigned by nanopolish, and so I am guessing there is an additional preprocessing done on top of the bam files to re-assign some of the positions found in the bam files (based on my personal experience it seems that nanopolish eventalign can both chop / lengthen the positions covered by each read).

gurbonaite commented 1 year ago

Chris, thanks for responding. I have since figured out the issue and it was something on our end and not your end. I do have a kmer follow-up question: is there some inherent bias for a kmer to come up as modified? I sequenced in vitro transcribed RNA and something that was 0% modified was called as modified just as often as something that was 50% modified. The specific kmer I am worried about is GAACT.

chrishendra93 commented 1 year ago

hi @gurbonaite, good to hear that you have figured out the issue on your end. As for the follow-up question, can I check with you how you quantify the modified sites and which IVT RNA datasets are you using? Currently, m6Anet provides the probability of modification as well as our estimate of modification stoichiometry, and based on my experience with the curlcake datasets from Liu et al (the EpiNano paper), our model seems to be able to distinguish the IVT RNA rather well

gurbonaite commented 1 year ago

@chrishendra93, the IVT RNA that we are using I generated in-house, so it is not a publicly available dataset. I get the output of m6ANET and then I take the mean probability modified of every kmer over the entire set. It appears that the 0% m6A and the 50% m6A sets BOTH have very high levels of probability modified average for the kmer GAACT. More specifically, the average of the prob modified for that kmer for both the 0% and 50% is over 0.8. For 50% it could make sense, but for 0% it does not make sense.

chrishendra93 commented 1 year ago

@gurbonaite I see that seems interesting. What about the m6a stoichiometry for those positions? Do you observe this just for GAACT? How about the other 5-mer motifs? So far we have not observed this behavior with IVT data from the EpiNano paper as well as the spike-in transcripts that our lab generates (which do not contain m6A modifications). What is the average length of your IVT transcripts, and do you think their signal values are comparable to typical GAACT from wild type data?

gurbonaite commented 1 year ago

My transcripts have a mean length of around 850 bp, but the longest transcripts were about 3 kb. The m6A stoichiometry of these positions is pretty much equal between the 0 and 50% IVT samples, and range from 0.2 to 0.7. GAACT is the most prevalent abnormality but this can also be seen for AGACC. The signal values are not comparable to biological data. There does not seem to be a skew like this in biological data (which I also ran in-house)

chrishendra93 commented 1 year ago

I'm afraid that if the data is not really comparable to biological data m6Anet might produce spurious predictions since it was trained on raw signals from human cell lines. Also, it is hard to tell right now from my end since I do not have access to these datasets, but will be more than happy to take a look closer into it should the dataset is available. If you have IVT samples and modified samples maybe you can try to run xPore as well to see if there are differences in the signal distributions between your IVT samples and the modified samples? Usually if xPore can detect differences in the signal values, m6Anet will likely predict such positions to be modified as well

GoekeLab / m6anet

What are these kmers? #50