fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
206 stars 54 forks source link

[BUG] Qvalue calculation is too conservative #171

Closed bvenn closed 2 years ago

bvenn commented 2 years ago

Describe the bug In FSharp.Stats.Testing.Multiple.Qvalues local FDRs are calculated and afterwards smoothed so that the q value of pi is the minimal FDR of all p values greater than pi.

While the local FDR calculation is correct, the smoothing does not take the minimal FDR of pvals greater than pi, but the maximal FDR of pvals lower than pi, which makes the computation more conservative as it must be.

image

Solution Modify the bindby function accordingly.

bvenn commented 2 years ago

The issue is more complex than I thought. While for monotonic pvalues the strategy works, but if many identical pvalues exist, the sorting corrupts the q value smoothing. If many identical keys exist (pvalues), it is not clear which index to choose.

Reproduce


#r "nuget: Plotly.NET, 2.0.0-preview.16"
open Plotly.NET

let index = Array.init 10000 id
let testValues =
    [|
        [|1. .. 5000.|]
        Array.init 2000 (fun x-> 5000.)
        [|5001..8000|]
    |]
    |> Array.concat

testValues |> Array.indexed |> Chart.Point |> Chart.show
System.Array.Sort(testValues,index)
index |> Array.indexed |> Chart.Point |> Chart.show

image

Edit: When Seq.sort or List.sort is used instead of Array.Sort the problem seems to be solved.

bvenn commented 2 years ago

The standard q value implementation is fixed. I decided to omit the bindBy function, since it reduces the readability and causes harm when the p value collection is too large. The monotonization of the q values is now packed within the respective function. Unit tests must be corrected and the Qvalues.ofPvaluesRobust requires further inspection of validity and proper documentation.

bvenn commented 2 years ago

The robust q value version has an additional term, that corrects small p values, especially when the number of tests is low. Its described in Storey, J.D. (2002), A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64: 479-498. https://doi.org/10.1111/1467-9868.00346 in function 9.

image