GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data
https://m6anet.readthedocs.io/
MIT License
101 stars 17 forks source link

Supplementary Table 5 #51

Closed kwonej0617 closed 1 year ago

kwonej0617 commented 1 year ago

Dear developer,

Hi, I am running m6anet tools to detect m6a modification sites. After generating the output data, I want to compare my result with your data to make sure I run correctly. In the ENA database, I saw there are three replicates of HEK293T WT and KO samples, However, Supplementary Table 5 only has probability of modified_wt and probability_modified_KO. I was wondering in order to get supplementary table 5 if you use m6anet-run_inference --input_dir demo_data_1 demo_data_2 ... --out_dir demo_data --infer_mod-rate --n_processes 4 to combine all three replicates from each condition, WT, and KO, after you run m6anet-dataprep for 6 samples (3 replicates for each condition)?

Also, when I run m6Anet with demo data, I had a different mod_ratio despite changing the command line in each run. Is there a possibility to get a different mod_ratio value?

I am looking forward to hearing from you. Thank you for your help!

chrishendra93 commented 1 year ago

hi @kwonej0617 ,

The inference on supplementary table 5 was only from one of the replicates (I think it was WT replicate 1 in the ENA database). Regarding the mod_ratio, you should not be getting different values for a different run unless you are using a different dataset. Let me check on that

kwonej0617 commented 1 year ago

Thanks for your reply!

kwonej0617 commented 1 year ago

This shows the first few lines in supplementary 5 data. image

I was wondering if you run m6anet with each HEK293-rep1-WT and HEK293-rep1-KO data and combined the result based on the overlapped position in the transcripts. (Actually, I generated my result based on this approach, but I got almost doubled the number of m6a sites compared to your data.) Also, several probability_modified_KO are very high (~ 0.9 or higher) even though they are not supposed to be that high due to the absence of m6a sites. Do you know what the reason is?

Thank you!

chrishendra93 commented 1 year ago

In the supplementary table in the manuscript, I average the probability of all positions that fall in the same genomic position, partly also because the labels that we use to validate the results are on the genome level. I also observed some positions in the KO datasets to have very high probability modified - this might be genuine sites from m6A positions that are not dependent on METTL3, and yes perhaps some errors as well

kwonej0617 commented 1 year ago

@chrishendra93 Thank you for your answer!

image For me, it looks like there is more than one m6a site located in the same genomic position, (ENSG00000184009/ 81511529). It looks you listed m6a site that fall in the same transcript positions, not the genomic position. If you averaged them that fall in the same genomic position, I think there should be one genomic position in the data. If I understand wrong, please correct me. Thank you!!

Also, I was wondering if you run m6anet with all three replicates (rep 1,2, and 3) in each condition WT and KO and you averaged the probabilities of each condition based on the transcript position to get the result.

Also, you mentioned nf-core/nanoseq(https://doi.org/10.5281/zenodo.3697960 in your paper, but I am not quite sure how you used it. Did you use nf-core for running guppy and minimap with raw fast5 data2? Could you explain more about this?

Thank you so much!

chrishendra93 commented 1 year ago

hi @kwonej0617, my bad I think you understand it correctly, the one I provided in supplementary table 5 contains all predictions on the transcriptomic level, and the averaging happens later during the evaluation. Regardless, we might be using a different version of the software, resulting in a different number of supports for some of the positions being mentioned.

I did not run m6Anet with all three replicates on the paper but you can technically do that and average the probabilities to get more accurate results (or with the latest version of m6Anet, you can even pool all the reads from all three replicates). The nanoseq contains the annotations we used to do basecalling and minimap

kwonej0617 commented 1 year ago

@chrishendra93 I really appreciate your reply! Could you provide your version of m6anet to generate the data? Also, how can I check my version of m6anet?

Thank you so much for your help!!

chrishendra93 commented 1 year ago

hi @kwonej0617, I think the version of the m6anet used to generate the data was the pre-release version (the binary is available here). There might be slight differences in the number of sites generated because the pre-release is slightly more stringent in filtering the positions but it should not make a lot of difference in the number of sites generated. The model weight is the same though, so the prediction should more or less be the same/similar (slight difference due to sampling)

Checking the version depends on how you installed m6anet. If it is from the github, you can just check the version on setup.py

kwonej0617 commented 1 year ago

Thanks for your reply. @chrishendra93

Is there a way I could try the stringent filtering used in the pre-released version with the recent version, v-1.1.0?

Thanks for your help.

chrishendra93 commented 1 year ago

hi @kwonej0617, you can just download the binary for the pre-released version. Otherwise I don't think this make any difference, I feel that the basecalling version and nanopolish version that you run make the most difference in the preprocessed results

kwonej0617 commented 1 year ago

Hi, @chrishendra93 Could you please tell me which version of guppy you used to generate your supplementary data?

Thank you!