CSBiology / BioFSharp

Open source bioinformatics and computational biology toolbox written in F#.
https://csbiology.github.io/BioFSharp/
MIT License
107 stars 32 forks source link

[BUG] Include all possible bin sizes within GSEA (0 and 1) #102

Open bvenn opened 4 years ago

bvenn commented 4 years ago

Describe the bug

For gene set enrichment analysis (GSEA) fishers exact method is applied to analyse over-, or under representated groups. The method is based on multiple hypergeometric distribution tests.

In BioFSharp.Stast.OntologyEnrichment.CalcHyperGeoPvalue, two cases get individual treatments: When the number of differentially expressed genes in a random bin is 0 or 1, the pValue is reported as nan. While these cases might not be of interest, a true pValue can be calculated.

For further analysis a multiple-testing-correction can be performed. The BenjaminiHochberg-method calculates false discovery rates (FDR) for every p value. nans cannot be processed, so they get filtered out. This filtering of p values that could have been calculated manipulates the FDR-calculation and keeps p values more flat than expected.

Often bins of sized lower than 5 are not of interested and are rejected anyway. The filtering should be supervised by the operator and have to be performed after the enrichment analysis and prior to multiple testing correction.

The current filter within the GSEA leads to results that cannot be easily interpreted.

Solution