Open therealgenna opened 3 years ago
@therealgenna Thank you for your suggestion. Could please provide repeatable code and data so that we can modify this problem more conveniently.
@huerqiang I'll try to send you a code with an example, but I've already pointed to the problem, which is easy to see.
From the code of gseaScores
(https://github.com/YuLab-SMU/DOSE/blob/master/R/gsea.R), lines 470-475:
Phit <- Pmiss <- numeric(N)
hits <- names(geneList) %in% geneSet ## logical
Phit[hits] <- abs(geneList[hits])^exponent
NR <- sum(Phit)
Phit <- cumsum(Phit/NR)
I have a case that all geneList[hits]
equal 0
. Then clearly NR=0
and in the last line above Phit
becomes NaN
s.
I think maybe a reasonable fix would be to do the following - add the indicated line:
Phit <- Pmiss <- numeric(N)
hits <- names(geneList) %in% geneSet ## logical
Phit[hits] <- abs(geneList[hits])^exponent
NR <- sum(Phit)
# possible fix - add this line here (maybe also issue some warning):
if(NR==0) {Phit[hits] <- 1; NR <- sum(Phit)} # NR should then equal Nh
Phit <- cumsum(Phit/NR)
I can create an example simply by picking one of the gene sets and setting scores for those genes to 0 (so that geneList[hits]
will all be zero. I will send this later if you think it's useful.
Hi @huerqiang , here is the example and the data is attached in error_exampleData.RData.zip
(I had to zip it because I cannot upload RData files)
error_exampleData.RData.zip
:
> gsea = GSEA(geneList, TERM2GENE=wpid2gene, TERM2NAME=wpid2name,pAdjustMethod="BH", pvalueCutoff=1)
preparing geneSet collections...
GSEA analysis...
Error in if (abs(max.ES) > abs(min.ES)) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (0.51% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
I've noticed that I also apparently need pvalueCutoff=1
to get the error, but this is a legitimate value, if I want to see everything.
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] org.Dr.eg.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.1
[4] S4Vectors_0.28.1 Biobase_2.50.0 BiocGenerics_0.36.0
[7] clusterProfiler_3.18.0 ggplot2_3.3.3 tidyr_1.1.2
loaded via a namespace (and not attached):
[1] viridis_0.5.1 tidygraph_1.2.0 bit64_4.0.5 viridisLite_0.3.0
[5] splines_4.0.3 ggraph_2.0.4 assertthat_0.2.1 shadowtext_0.0.7
[9] DO.db_2.9 rvcheck_0.1.8 BiocManager_1.30.10 blob_1.2.1
[13] ggrepel_0.9.1 pillar_1.4.7 RSQLite_2.2.3 lattice_0.20-41
[17] glue_1.4.2 downloader_0.4 digest_0.6.27 RColorBrewer_1.1-2
[21] polyclip_1.10-0 qvalue_2.22.0 colorspace_2.0-0 cowplot_1.1.1
[25] Matrix_1.3-2 plyr_1.8.6 pkgconfig_2.0.3 purrr_0.3.4
[29] GO.db_3.12.1 scales_1.1.1 tweenr_1.0.1 enrichplot_1.10.2
[33] BiocParallel_1.24.1 ggforce_0.3.2 tibble_3.0.6 generics_0.1.0
[37] farver_2.0.3 ellipsis_0.3.1 cachem_1.0.1 withr_2.4.1
[41] cli_2.3.0 magrittr_2.0.1 crayon_1.4.0 memoise_2.0.0
[45] DOSE_3.16.0 fansi_0.4.2 MASS_7.3-53 tools_4.0.3
[49] data.table_1.13.6 lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0
[53] compiler_4.0.3 rlang_0.4.10 grid_4.0.3 rstudioapi_0.13
[57] igraph_1.2.6 labeling_0.4.2 gtable_0.3.0 DBI_1.1.1
[61] reshape2_1.4.4 graphlayouts_0.7.1 R6_2.5.0 gridExtra_2.3
[65] dplyr_1.0.3 fastmap_1.1.0 bit_4.0.4 utf8_1.1.4
[69] fastmatch_1.1-0 fgsea_1.16.0 GOSemSim_2.16.1 stringi_1.5.3
[73] Rcpp_1.0.6 vctrs_0.3.6 tidyselect_1.1.0 scatterpie_0.1.5
Thank you.
@therealgenna I looked at your data and found that there are 23119 genes in your geneList, of which 17456 have a value of 0. If you got these values through foldchange, it is likely that your expression profile has not been preprocessed, and the expression values of many genes in a certain group are all 0..
@huerqiang Yes, I have many zeros, but this is my scoring, which is based on multiple factors, including fold changes and p.adjust values in multiple comparisons. I'd assume the function would/should still work as I don't have to have ranks based on a single set of fold changes alone. Am I wrong?
You should persuade us that your data is truly valid. Otherwise the only reason I can think of is that more than half of your data are missing and GSEA is not designed for it.
@GuangchuangYu As I wrote above, I derived my scoring based on multiple pairwise comparisons, each one having fold changes and (adjusted) p-values. Either when p.adjust is 1 or reaches some high threshold (which I can choose) in multiple comparisons for a given gene, its score will be zero - this is how I decided to do that and there is nothing wrong about it and I will know how to interpret the result.
I have opened this issue simply because your function does not behave gracefully - it just throws an error, fails and no result is produced. This should not happen in a well-polished code. That's my point basically. I am just saying that if it comes to the situation when NR
is zero the function should know how not to crash. I am not claiming that my suggested fix is good, but something should be done in that case, like skipping the gene set for which NR is zero (it does not happen for all gene sets in the list of sets), returning zero enrichment score etc., but going through all gene sets provided and returning the result to the user. Wouldn't you agree that a hard crash of the gseaScores function is not the best way to handle it?
Returning NAN instead of throw error is not always 'well-polished'. In case the NAN was generated due to a wrong input, it would be better to throw error.
How do we rank a gene list for half of them are zero?
pls answer my question directly instead of teach me how to polish my code.
If you can't persuade us, we can't help.
@GuangchuangYu In case of the NAN was generated it would be better to give some informative warning (and proceed with other gene sets.)
If your question is "How do we rank a gene list for half of them are zero?" - it's the user who provides you with the ranked list, so it's user's responsibility. I still might have thousands of genes with positive and negative score and the idea of GSEA to try and pick weak signal is still valid.
One can replace zeros by tiny randomly generated values around zero; for example I can modify my scoring to give some values like that (e.g., not return 0 when padj=1; either do it randomly or deterministically) but really there would be no good way to say that some of these "zero" genes are higher ranked than the others.
If it is important for the program to know the "true" order of "zero" genes and the results would be substantially different if I permute this "zero" subset, then, first, it is not clear to me that this is necessarily the case [is it?], and if it is then it might make sense to do some permutation reorderings of this subset and get some average.
I am not sure why you are so reluctant to make some fixes to avoid the hard crash. We are talking about two separate issues here: (i) can I use a gene list with many genes having score zero, and (ii) code crashing. My focus is on (ii). You are worried about (i) but as I said I think the idea of GSEA is still valid. You seem to say no and I don't know why. But even if you stick to your opinion, for example because the actual order of the "zero" genes affects results a lot for many/all gene sets, at least giving an indication to the user of why the function fails would make sense (still, making it not fail would be preferable.)
Hi,
I am using R v4 and the latest clusterProfiler and DOSE package (DOSE version 3.16.0) and I get the following error:
As far as I can tell this comes from line 23 of
DOSE:::gseaScores
. The error happens when scores of all the genes in the considered geneSet are 0. In this case NR=0 and division by NR inPhit <- cumsum(Phit/NR)
leads to allPhit
values set to NaN. This eventually leads to the above-mentioned error.It might be unlikely to expect zero scores for some genes, but I do get them, depending on how I get these scores. One dirty way to fix it would be to simulate tiny random deviations from zero for zero scores, but I wonder if there is a cleaner way to do it.
Thank you