m6anet queries - Githubissues

Vadim-Fed commented 3 years ago

Hi Chris,

Thanks for releasing the new model for m6anet.

So far we managed to install and run (m6anet-dataprep & m6anet-run_inference) this tool on 4 Nanopore sequenced datasets obtained for 2 WT and 2 ime4-knockout yeast strains (the latter of which should be devoid of any m6AS). The data we are employing is the one derived in this paper by the Maria-Novoa lab. I have 3 queries :

Does this tool allow directly comparing a WT sample to a KO counterpart?
For some reason we do not get a quantification of the prediction quality - The “probability_modified” column at the final output file contains only NA’s for all predicted m6A locations. How can this be resolved?

3.From our preliminary results it seems, surprisingly, that the predicted m6A locations are very similar between the WT and the KO. We obtain a similar number of predicted sites for both groups, with few unique m6A sites in the WT with regard to the KO.

We wondered if changing some of the default arguments may help us perform a more refined prediction. Some of the arguments which might be relevant (but which we haven't yet played around with, given that we could not find documentation on how to best modulate them) could include the --n_neighbors argument (in m6A-net-dataprep) or potentially also --model_config and --model_state_dict in m6anet-run_inference. If you can provide a short explanation for these arguments it will be great.

Thanks,

Vadim

chrishendra93 commented 3 years ago

Hi @Vadim-Fed , thanks for raising the issue.

Currently the only way to compare the two samples together is by checking the probability modified of each sites of interest between the two samples.
The probability modified columns should give the probability that a particular site is modified. If it is NA, I fear that there must be something wrong in the preprocessing steps.
The number of rows do not correspond to the number of modified sites, but rather each site that has minimum read number of 20 will be displayed. The modification status of each site should be reflected by the probability modified column

To fix your problem, can I trouble you with printing the first 5 rows of the eventalign.txt file and perhaps the first five entries from data.json?

thank you!

Rgards

Christopher Hendra

Vadim-Fed commented 3 years ago

Sure, no trouble at all. WT2 eventalign.txt: contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level chr01 1220 ATTGT 1 t 1535 110.50 3.360 0.01029 NNNNN 0.00 0.00 inf chr01 1253 CACTT 1 t 1534 112.71 1.968 0.00398 AAGTG 117.29 4.03 -1.03 chr01 1254 ACTTT 1 t 1533 110.15 1.686 0.00398 AAAGT 108.73 4.06 0.32 chr01 1254 ACTTT 1 t 1532 108.27 1.348 0.00232 AAAGT 108.73 4.06 -0.10 chr01 1254 ACTTT 1 t 1531 109.90 1.617 0.00531 AAAGT 108.73 4.06 0.26 chr01 1254 ACTTT 1 t 1530 111.42 1.583 0.00863 AAAGT 108.73 4.06 0.60 chr01 1254 ACTTT 1 t 1529 105.97 1.604 0.00432 AAAGT 108.73 4.06 -0.62 chr01 1254 ACTTT 1 t 1528 112.65 1.532 0.00432 AAAGT 108.73 4.06 0.87 chr01 1254 ACTTT 1 t 1527 110.31 1.712 0.00398 AAAGT 108.73 4.06 0.35

data.json- {"chr01":{"7589":{"TTAACAA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"7646":{"TGAACTA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"7669":{"GAAACAT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"7763":{"AGGACTT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"7781":{"CTAACTG":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"7919":{"TAAACTT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"8168":{"GAGACAA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"8180":{"AGAACAT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"8197":{"TAAACCC":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}} {"chr01":{"8235":{"CAAACCA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}

Thanks,

Vadim

chrishendra93 commented 3 years ago

Hi @Vadim-Fed , I think I know the problem. In order for m6anet to be able to extract the features effectively, it needs the segmentation information from nanopolish eventalign.txt which is missing from your eventalign.txt. In order to get this information, you have to run nanopolish eventalign with the option --signal-index. So in short, please run nanopolish again with the following command:

nanopolish eventalign --reads reads.fastq --bam alignment.sorted.bam --genome genome.fa --scale-events --signal-index --summary summary.txt --threads n_threads > eventalign.txt

Vadim-Fed commented 3 years ago

Ok, ill try and update...

Thanks a lot!

Vadim-Fed commented 3 years ago

Now it works and we do get the probability column. Thanks!

Just a couple more questions regarding the 3rd query if I may: You responded: "The number of rows do not correspond to the number of modified sites, but rather each site that has minimum read number of 20 will be displayed. The modification status of each site should be reflected by the probability modified column"

I checked and all predicted modified sites (all rows) are centered at the m6a consensus sequence DRACH, which is good. From your experience which probability threshold would you use to decide if the site is indeed mofidied or not?

Also which arguments can I attenuate in m6anet to get more predictions? I already noticed that doing -dataprep with the default n_neighbors is more permissive than n_neighbors set to 10 as i got ~half of total predictions with n_neighbors 10. Which additional arguments would you suggest to play around with and what are their value range?

Thanks,

Vadim

chrishendra93 commented 3 years ago

Hi Vadim, sorry for the late reply

By default m6Anet will only predict modification for DRACH consensus sequence
If you really want to be stringent with the false positives, setting a threshold of 0.8 is good, 0.9 is even better to my experience
One way to increase the number of prediction is perhaps to reduce the number of reads required for each site. Currently I have not implemented this functionality since the model itself is not trained to handle this scenario, I will try to explore this in future releases

thanks!

Vadim-Fed commented 3 years ago

Thanks @chrishendra93 for the reply.

We are using 4 Nanopore sequenced datasets obtained for 2 WT and 2 ime4-knockout yeast strains (the latter of which should be devoid of any m6AS), and when we perform the analysis we get a very similar number of predicted m6A locations between the WT (869 on average) and the KO (877 on average) with few unique WT m6A locations.

It seems using the modification probability parameter for thresholding is not an option (The numbers in brackets correspond to number of predicted locations shared within each group and filtered to overlap with known m6A sites:

m6A prediction

Apart from reducing the number of reads required for each site (is it the --readcount_min parameter? because it states the default is 1), are there any additional arguments we can change so to obtain more reasonable results, or maybe even perform some training for these tool…

chrishendra93 commented 3 years ago

Hi Vadim, thanks for trying this out in this particular dataset. As I have mentioned before, currently the model can only predict with a minimum of 20 reads per site. The --readcount_min parameter for dataprep is the minimum read for each gene not for each segment that will be predicted by m6anet. That being said, there can be a variety of reason for this error, but first let me clarify a couple of things with you

Have you checked if the known sites really match the ones you observe from m6anet prediction files? It might be that the coordinates are misaligned, can you check if the known sites have the DRACH motifs so that we're sure they are aligned?
Can you check if among all the shared sites, m6anet generally have higher predicted probability in wild type compared to KO samples? The thing is from our own experiment we find that m6anet tends to find sites that have elevated level of signal intensity in wild type compared to knock out, and these sites sometimes are missed by other experimental protocols that are used to label these "known" m6a sites
Have you tried running other tols on this dataset? I.e, epinano?
Furthermore, is it possible for you to send me the data.json and data.readcount files so that I can take a look at these?

Meanwhile you can also try out another datasets used in this paper https://www.biorxiv.org/content/10.1101/2020.06.18.160010v1.full.pdf. The dataset provided there have more number of reads and will provide a better overview of m6anet general performance

Vadim-Fed commented 3 years ago

Hi @chrishendra93, thanks for the response. The Known sites are defined as windows with a median of 337 bp across all windows. I defined a predicted m6A coordinate as valid if it was overlapping within a window, and I was quite loose as I enabled even overlap at the edge of the window with the predicted coordinate. Ok so I did the same plot without intersecting with the reference, just looking at shared predicted sites among the 2 datasets of each group: Erase We are currently also trying to run Epinano, I will keep you posted. You can find the files at this link - https://drive.google.com/drive/u/0/folders/1ZVkd_TmdAGccmvB1DRB7V3fAQIGLqoAJ

Thanks for all the help,

Vadim

chrishendra93 commented 3 years ago

Hi @Vadim-Fed , I have requested an access to your google drive to take a look at the data. Regardless can I clarify with you again regarding the known sites that you have classified? Is it possible that you have multiple predicted sites in a single window? It might be the case that there are very few modified positions within that window but by aggregating all those sites in your boxplots you end up with a lot of unmodified probability. Can you check what will the boxplots look if you only take the maximum probability per window?

GoekeLab / m6anet

m6anet queries #9