Issues with Read Selection

GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data

https://m6anet.readthedocs.io/

MIT License

104 stars 19 forks source link

Issues with Read Selection #163

Open AKjhu opened 5 months ago

AKjhu commented 5 months ago

Hello, I think I reached out via email but wanted to post this publicly as well. I am currently facing an issue with my m6anet data analysis. I have two conditions that I have run m6anet and I am comparing the two to identify differential expression. Each condition is based on the same genome of ~17000 base pairs, and the experimental condition has about 6 times as many reads as the wild type condition. When analyzing the site_proba.csv files output of m6anet, which contains the aggregated modification probabilities on a site level (as opposed to on a read level), we see that from base pairs 0-9000, about 6 times as many reads are selected by m6anet for the experimental group (inh_data.site_proba.csv) as the control group (data.site_proba.csv), which makes sense based on the total number of reads there are to select from. However, after some point around base pair 9000, this relationship inverts, and we about 6 times LESS reads are selected by m6anet to generate the aggregated site probability, which does not make sense since at any point, we know that around 6 times greater reads in the experimental group align with a site as compared to the wild type group. When looking at the other m6anet outputs, it seems to be likely that this step occurs somewhere in the m6anet dataprep function or even before, so I was curious as to what the criteria is by which m6anet selects reads to analyze for modification probabilities and why this flip might be occurring in the data. data.site_proba.csv inh_data.site_proba.csv

yuukiiwa commented 5 months ago

Hi @AKjhu,

Sorry for the delayed reply!

Please do not send any emails to @chrishendra93 or me because we can only reply to issues here.

m6anet supports only transcriptome alignment, so it will be great if you check whether you are aligning your samples to a cDNA.fasta/transcriptome.fasta

The data.indv_proba.csv contains read probability of all the reads while the data.site_proba.csv contains the site probability of based on the reads. For each site, m6anet randomly picks 20 reads from each site and calculate a site probability based on those reads.

If you are looking into comparing an experimental and a control group, it will be great if you can check out xpore, which does differential modification analysis of two or more conditions.

Thanks!

Best wishes, Yuk Kei

AKjhu commented 4 months ago

I guess my question was more related to the fact that at some position in the genome, we see m6anet rapidly decrease the amount of reads it selects in the data.indiv_proba.csv files. If this file matches the total number of reads, we should be seeing this decrease in our alignment files, which we confirmed that we don't see at all. I was curious as to how m6anet actually iterates through these reads and picks them to be added to the file because there seems to be something that is causing some of the reads after a certain position to fail to be selected.