pathways for which P-values were not calculated properly due to unbalanced gene-level statistic values

shangguandong1996 commented 4 years ago

Hi, Dear developer when using fgsea, I found the below warning

And this warning will make the correspondent pvalue in pathway become NA. So I am curious about the cause of this warning. Is it means my postive statistic values number is not same as negative?

> table(vals > 0)

FALSE  TRUE 
24237 13099

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/sysoft/R-3.6.1/lib64/R/lib/libRblas.so
LAPACK: /opt/sysoft/R-3.6.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fgsea_1.12.0    Rcpp_1.0.3      forcats_0.4.0   stringr_1.4.0  
 [5] dplyr_1.0.0     purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [9] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0    haven_2.2.0         lattice_0.20-38    
 [4] colorspace_1.4-1    vctrs_0.3.2         generics_0.0.2     
 [7] rlang_0.4.7         pillar_1.4.3        glue_1.4.1         
[10] withr_2.1.2         DBI_1.1.0           BiocParallel_1.19.6
[13] dbplyr_1.4.2        modelr_0.1.5        readxl_1.3.1       
[16] lifecycle_0.2.0     munsell_0.5.0       gtable_0.3.0       
[19] cellranger_1.1.0    rvest_0.3.5         parallel_3.6.1     
[22] fansi_0.4.0         broom_0.5.3         scales_1.1.0       
[25] backports_1.1.5     jsonlite_1.6        fs_1.3.1           
[28] gridExtra_2.3       fastmatch_1.1-0     hms_0.5.2          
[31] packrat_0.5.0       stringi_1.4.3       grid_3.6.1         
[34] cli_2.0.0           tools_3.6.1         magrittr_1.5       
[37] lazyeval_0.2.2      crayon_1.3.4        pkgconfig_2.0.3    
[40] ellipsis_0.3.0      Matrix_1.2-18       data.table_1.12.6  
[43] xml2_1.2.2          reprex_0.3.0        lubridate_1.7.4    
[46] assertthat_0.2.1    httr_1.4.1          rstudioapi_0.10    
[49] R6_2.4.1            nlme_3.1-143        compiler_3.6.1

Best wishes

Guandong Shang

assaron commented 4 years ago

That means that the statistic values are not symmetric and the probability of having positive (or negative) enrichment score is far from 0.5. In such cases it can be very hard to estimate P-values. It can be a sign of gene expression data not being properly normalized before differential expression.

shangguandong1996 commented 4 years ago

But why some pathways may succeed to estimates P-values while some may not, if this because of the not being properly normalized

assaron commented 4 years ago

That probability depends on pathway size. For some size it's closer to 0.5 and doesn't make things too bad.

shangguandong1996 commented 4 years ago

please forgive me if I misunderstand something. so fgsea make a assumption that the up and down gene number are approximately same ? But it may failed in some samples

assaron commented 4 years ago

GSEA P-value is calcaulated as P(ES >= x)/P(ES > 0), where x is the enrichment score of the tested pathway (assuming it to be positive) and ES is an enrichment score of a random gene set of the same size. On some datasets and pathways the denominator probability P(ES > 0) can be very low and hard to estimate properly. If the ranking is balanced, then it will be around 0.5 and can be estimated easily. So, no, there is no explicit requirement of balance, but if the ranking is unbalanced, there will be warnings.

Still, GSEA makes much more sense if the ranking is more or less balanced. If it's far from that, then one-tailed test can be considered, which doesn't normalize on P(ES > ). This can be controlled by scoreType parameter.

shangguandong1996 commented 4 years ago

Thanks, Alexey, I get it.

ctlab / fgsea

pathways for which P-values were not calculated properly due to unbalanced gene-level statistic values #75