YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1k stars 252 forks source link

gseKEGG function running indefinitely with a gene list of over 3000 genes #709

Open Sophia409 opened 1 month ago

Sophia409 commented 1 month ago

Hello, I am encountering an issue while using the gseKEGG function from the clusterProfiler package for GSEA enrichment analysis. I have provided a gene list containing just over 3000 genes, but the function has been running for two hours without completing. I manually stopped the process and attempted to modify the function parameters, but after several tries, the function still hangs. Here are the details of my setup: clusterProfiler version: 4.12.0 (latest version) I would appreciate any insights into why this might be happening and how I can resolve this issue. Thank you for your help! Best regards,

> genelist <- genelist[names(genelist) %in% entrez[,1]]
> names(genelist) <- entrez[match(names(genelist),entrez[,1]),2]
> genelist <- sort(genelist, decreasing = T) #按log2FC高低排序
> length(genelist)
[1] 3786
> head(genelist)
  20304   20306   20296   16175   14825  117167 
1112.53 1059.17 1018.66  651.99  608.66  603.55 
> #2)基于KEGG基因集的GSEA富集
> set.seed(123)
> KEGG_ges <- gseKEGG(
+   geneList = genelist,
+   organism = "mmu",
+   minGSSize = 10,
+   maxGSSize = 500,
+   pvalueCutoff = 0.05,
+   pAdjustMethod = "BH",
+   verbose = FALSE,
+   eps = 0)
Reading KEGG annotation online: "https://rest.kegg.jp/link/mmu/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/mmu"...
警告信息:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (80.9% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize,  :
  There were 1 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)
> 
> KEGG_ges <- gseKEGG(
+   geneList = genelist,
+   organism = "mmu")
preparing geneSet collections...
GSEA analysis...
警告信息:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (80.9% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize,  :
  There were 1 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)
> KEGG_ges <- gseKEGG(
+   geneList = genelist,
+   organism = "mmu",
+   nPermSimple = 10000)
preparing geneSet collections...
GSEA analysis...
guidohooiveld commented 1 month ago

Did you carefully read the messages that were returned?

This is the key remark: There are ties in the preranked stats (80.9% of the list).

In other words, 81% of your input data has an identical ranking metric! Why? This cannot be correct...

Anyway, this results in behavior reported before: https://github.com/ctlab/fgsea/issues/151