For gene set enrichment analysis (GSEA) fishers exact method is applied to analyse over-, or under representated groups. The method is based on multiple hypergeometric distribution tests.
In BioFSharp.Stast.OntologyEnrichment.CalcHyperGeoPvalue, two cases get individual treatments: When the number of differentially expressed genes in a random bin is 0 or 1, the pValue is reported as nan. While these cases might not be of interest, a true pValue can be calculated.
For further analysis a multiple-testing-correction can be performed. The BenjaminiHochberg-method calculates false discovery rates (FDR) for every p value. nans cannot be processed, so they get filtered out. This filtering of p values that could have been calculated manipulates the FDR-calculation and keeps p values more flat than expected.
Often bins of sized lower than 5 are not of interested and are rejected anyway. The filtering should be supervised by the operator and have to be performed after the enrichment analysis and prior to multiple testing correction.
The current filter within the GSEA leads to results that cannot be easily interpreted.
Solution
Remove the if expression within CalcHyperGeoPvalue
add additional context for filtering procedures to the documentation
Describe the bug
For gene set enrichment analysis (GSEA) fishers exact method is applied to analyse over-, or under representated groups. The method is based on multiple hypergeometric distribution tests.
In
BioFSharp.Stast.OntologyEnrichment.CalcHyperGeoPvalue
, two cases get individual treatments: When the number of differentially expressed genes in a random bin is 0 or 1, the pValue is reported as nan. While these cases might not be of interest, a true pValue can be calculated.For further analysis a multiple-testing-correction can be performed. The BenjaminiHochberg-method calculates false discovery rates (FDR) for every p value.
nan
s cannot be processed, so they get filtered out. This filtering of p values that could have been calculated manipulates the FDR-calculation and keeps p values more flat than expected.Often bins of sized lower than 5 are not of interested and are rejected anyway. The filtering should be supervised by the operator and have to be performed after the enrichment analysis and prior to multiple testing correction.
The current filter within the GSEA leads to results that cannot be easily interpreted.
Solution
Remove the if expression within
CalcHyperGeoPvalue
add additional context for filtering procedures to the documentation
consider renaming the functions to lower case