Further question about "p-value in the presence of NAs"

ericsu6 commented 1 month ago

First off, thank you for this amazing tool!

We are high school students from the Broad Summer Scholars Program using rMATS. We are excited to learn from your RMATS software whether changes in exon usage are shared across different species when they are exposed to high temperature—or whether species instead tend to have their own unique responses. To be sure we understand, we had a couple of questions regarding: how this program dealt with datasets including NAs (and subsequent calculations of the p-value and FDR), and the calculation and use of the correlation parameter when data for a given exon are not complete.

Our project entails finding the proteins that certain species such as camels use to withstand extreme temperatures. The data for 12 species was run through rMATS and we are using R to analyze datasets of different splicings for each of the 12 species. Our goal is to find the exons that are either increased or decreased in inclusion in the transcripts at a higher temperature. After finding these exons, we will deduce what the function of that protein connected to the exon is.

We have noticed that for certain exons, there are no paired individuals among the five in our data set that have full data (each pair has an NA), yet the p-value calculated is significant. We are concerned that this may give us exons that are significant and have a significantly different inclusion at higher temperatures when in reality we simply lacked data from which to make confident conclusions.

The ultimate goal of this project is to hopefully identify the exons that have altered inclusion at a higher temperature and see if we can draw similarities for those exons in humans.

(The example in the file linked is an exon from the exon skipping data of the Bactrian Camel. There is only one complete pair, yet the FDR and p-value are 0)

We wonder if, perhaps, when RMATS is set to do paired analysis but finds no complete pairs, it instead defaults to unpaired analysis, and then produces a significant FDR if the difference between the group-level means is significant?

Thank you in advance, Eric Su and Georgia Allaire

EricKutschera commented 1 month ago

When rmats is run with --paired-stats PAIRADISE is used. It will use the values for pairs where each of the individuals in that pair have at least 1 read (inclusion or skipping) for the event: https://github.com/Xinglab/PAIRADISE/blob/master/pairadise/src/pairadise_model/R/pairadise.R#L140

For an event where each pair has at least 1 NA IncLevel the PValue and FDR from --paired-stats should be NA. In the example you posted the last pair has IncLevel1=0.035 and IncLevel2=1.0 and --paired-stats will use the read counts for that pair to calculate a PValue and FDR. Depending on the read counts PValue=0 and FDR=0 could be reasonable. The unpaired model outputs PValue=1 if any replicate has NA IncLevel: https://github.com/Xinglab/rmats-turbo/issues/318

If you have example events can you post the full row from the MATS output file and also the version of rMATS and your command line?

ericsu6 commented 1 month ago

Hi Eric,

Thank you for your response. We ran rMATs with paired stats and it worked. For genes with no paired data, the p-value and FDR was NA.

Xinglab / rmats-turbo

Further question about "p-value in the presence of NAs" #417