GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data
https://m6anet.readthedocs.io/
MIT License
103 stars 19 forks source link

data.result & biological replicates #25

Closed YCCHEN23 closed 2 years ago

YCCHEN23 commented 2 years ago

Hi @chrishendra93

I performed the latest m6anet on all my samples, and the results are quite useful for me to design further experiments and compare the results to other software.

However, I had some questions:

  1. minimum read count threshold (in m6anet-run_inference):

We had sequenced 6 Nanopore DRS libraries and obtained 2 ~ 2.5 million "aligned reads" for each library. The median number of aligned reads is about 25 (aligned reads per gene) in our samples. So, in our case, about half of the expressed genes would be directly excluded from the final results under the default criteria, just because the aligned reads at these genes were less than 20. It would cause some bias in interpreting the information of transcriptome-wide m6A sites, only the sites in "abundant genes" could pass the threshold. (The problem might be easily solved by improvement of the throughput in the future)

Further, the "aligned reads" is largely affected by the throughput of libraries. If gene_A1 has 21 reads in Replicate.1; 18 reads in Replicate.2; 19 reads in Replicate.3. It's obvious that only the sites in Replicate.1 would pass the threshold, while, all the other m6A sites in Replicate.2 & Replicate.3 would be lost in the final results. (We had encounter such an issue for some critical genes)

I had read the issue of #13, and know it's hard to implement such a setting due to the model were trained ready for "minimum read count threshold = 20". So, is it possible (or is it proper?) to take all the biological replicates into account at the same time? (ex: All reads in gene_A = 21+18+19, then using these 58 reads for analysis)

  1. DRACH motif

In mammalians, DRACH motif is the most conserved consensus sequence of m6A site, however, "RRACH" motif is announced to be the most in plants.

So, it would be great for plant biologists (like me) if there's a column for recording the type of motif (GGACA, AAACT, etc) in data.result.csv.

Feel free to let me know if the questions above are not reasonable.

Many thanks

YCCHEN

chrishendra93 commented 2 years ago

hi @YCCHEN23

  1. I think it is a good feature to have in the future. I have experimented with this a bit but the code has not been tested yet. I hope that I can have this feature for the next release so that people can have the options of pooling reads for genes from different biological replicates
  2. I think this too should be an easy feature to add. Previously I was not sure how informative this will be since you can technically access the information from data.json so I removed the motif column to reduce the output file size. I can add this feature if you think it is helpful

Thank you for your suggestion

Regards

Christopher Hendra