RNA-Seq by example the concept of FDR should be elaborated

nirjharaloy commented 2 years ago

I was going through the theories in chapter 6 and I found that the concept of FDR could be clearer. This is what's written in the book:

"FDR - the False Discovery Rate - this column represents the fraction of false discoveries for all the rows above the row where the value is listed. For example, if in row number 300 the FDR is 0.05, it means that if you were cut the table at this row and accept all genes at and above it as differentially expressed then, 300 * 0.05 = 15 genes out of the 300 are likely to be false positives. The values in this column are also called q-values."

Now, although the concept of FDR was derived from the concept of false positive, mathematically FDR or q-value is not simply the number of false positive. We should rather say, FDR (q-value) is an estimate of the number of significant P-values which have been falsely detected. To put that succinctly, Q=E(V/V+S), Where V= number of null hypothesis falsely rejected S= Number of null hypothesis correctly rejected

This was the original equation proposed by Benjamini and Hochberg in 1995. https://www.jstor.org/stable/2346101?seq=1#metadata_info_tab_contents

However, they had defined Q=0 when V+S=0, which I never understood. Mathematically, Q=0 when V=0. If V+S=0, the number becomes undefined! How can the estimate of "undefined" be zero?

ialbert commented 2 years ago

The problem with trying to pin down what the words actually mean is that when we try to be precise, the definitions become very convoluted and we end up needing to continuously explain every single term - the point we are not even sure what it says.

In general, in bioinformatics, we fundamentally use a handwaving argument, the p-values and FDRs are little more than rough estimates and sanity checks to allow us to identify certain properties of the results.

I mentioned in the book, I observed that nobody uses statistics correctly - including statisticians - they just misuse it in a less obvious manner than others.

Long story short I used the definition that I did because I felt it is good enough to convey the main interpretation, making it more "porper" might make it more convoluted and less useful.

But I do welcome it if people find and identify the actual, correct, proper definition.

nirjharaloy commented 2 years ago

I completely agree with you. I tried to put the equation in words, which can be "convoluted" to many. You are absolutely right that the use of statistics is just wild in literature. We have a RNA-seq dataset where the fold changes were described as negative numbers to indicate downregulation. Many of our faculty members have such datasets. However, mathematically foldchanges can never be negative numbers, they are always positive fractions before the log2 transformation.

I think at least the mentioning of the equation with the link to the original paper might help some people. Although, I could never make sense how the authors defined Q=0 if V+S=0. That's a mystery to me and I always wanted to discuss this in a forum.

Thank you.

biostars / biostar-handbook

RNA-Seq by example the concept of FDR should be elaborated #174