YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
967 stars 246 forks source link

gson 格式在最新版 enrichKEGG中出错 #685

Open Shuixin-Li opened 2 months ago

Shuixin-Li commented 2 months ago

我已经下载并安装clusterProfiler,DOSE,HDO.db的github版本

在跑enrichKEGG时候发现报错

> kk <- gson_KEGG('mmu')
Reading KEGG annotation online: "https://rest.kegg.jp/link/mmu/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/mmu"...
>   KEGG_enrich = enrichKEGG(gene = id_transform[,1],
+                            organism=kk,
+                            use_internal_data = TRUE #这行加或者不加都报错
+ )

Error in (function (cl, name, valueClass)  : 
  assignment of an object of class “NULL” is not valid for @‘keytype’ in an object of class “enrichResult”; is(value, "character") is not TRUE

#我的输入gene list是这样的
#         id_transform
#SETBP1         240427
#CITED1          12705
#RIMS4          241770
#MUC12       102633301
#SMO            319757
#CALCB          116903
#TMEM158         72309

返回去查看

> kk@keytype
NULL

按道理来说gson_KEGG创建的gson应该是带有 ENTREZID 的keytype的 (?)

> gson_KEGG
function (species, KEGG_Type = "KEGG", keyType = "kegg") 
{
    x <- download_KEGG(species, KEGG_Type, keyType)
    gsid2gene <- setNames(x[[1]], c("gsid", "gene"))
    gsid2name <- setNames(x[[2]], c("gsid", "name"))
    version <- kegg_release(species)
    gson(gsid2gene = gsid2gene, gsid2name = gsid2name, species = species, 
        gsname = "KEGG", version = version, accessed_date = as.character(Sys.Date(), 
            keytype = "ENTREZID"))
}
<bytecode: 0x1f8dcc08>
<environment: namespace:clusterProfiler>

可以请教一下作者原因吗

guidohooiveld commented 2 months ago

Hi, a couple of things:

First of all, why do you generate a GSON object with all mouse pathways?

Related to this, please check the help pages on how to call the enrichKEGG function, because you made some mistakes. Note that the argument organism should be the KEGG abbreviation of the organism you are analyzing; in your case thus mmu (and it should NOT be the GSON object!)

The argument gene should be a (character) vector of entrezids.

It is also recommended to leave the argument use_internal_data at its default setting FALSE (so up-to-date information is being downloaded from the KEGG website).

Thus the code below, in which the 7 ids are used that you listed, will do what you intended to do!

> library(clusterProfiler)
> 
> id_transform <- c("240427","12705","241770","102633301","319757","116903","72309")
> class(id_transform)
[1] "character"
> 
> KEGG_enrich = enrichKEGG(gene = id_transform,
+                          organism="mmu",
+                          use_internal_data = FALSE
+  )
> 
> 
> KEGG_enrich
#
# over-representation test
#
#...@organism    mmu 
#...@ontology    KEGG 
#...@keytype     kegg 
#...@gene        chr [1:7] "240427" "12705" "241770" "102633301" "319757" "116903" "72309"
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...5 enriched terms found
'data.frame':   5 obs. of  11 variables:
 $ category   : chr  "Environmental Information Processing" "Human Diseases" "Organismal Systems" "Organismal Systems" ...
 $ subcategory: chr  "Signal transduction" "Cancer: specific types" "Circulatory system" "Development and regeneration" ...
 $ ID         : chr  "mmu04340" "mmu05217" "mmu04270" "mmu04360" ...
 $ Description: chr  "Hedgehog signaling pathway - Mus musculus (house mouse)" "Basal cell carcinoma - Mus musculus (house mouse)" "Vascular smooth muscle contraction - Mus musculus (house mouse)" "Axon guidance - Mus musculus (house mouse)" ...
 $ GeneRatio  : chr  "1/2" "1/2" "1/2" "1/2" ...
 $ BgRatio    : chr  "58/9710" "63/9710" "144/9710" "181/9710" ...
 $ pvalue     : num  0.0119 0.0129 0.0294 0.0369 0.0416
 $ p.adjust   : num  0.0388 0.0388 0.0499 0.0499 0.0499
 $ qvalue     : num  0.00681 0.00681 0.00875 0.00875 0.00875
 $ geneID     : chr  "319757" "319757" "116903" "319757" ...
 $ Count      : int  1 1 1 1 1
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> as.data.frame(KEGG_enrich)[1:3,]
                                     category            subcategory       ID
mmu04340 Environmental Information Processing    Signal transduction mmu04340
mmu05217                       Human Diseases Cancer: specific types mmu05217
mmu04270                   Organismal Systems     Circulatory system mmu04270
                                                             Description
mmu04340         Hedgehog signaling pathway - Mus musculus (house mouse)
mmu05217               Basal cell carcinoma - Mus musculus (house mouse)
mmu04270 Vascular smooth muscle contraction - Mus musculus (house mouse)
         GeneRatio  BgRatio     pvalue   p.adjust      qvalue geneID Count
mmu04340       1/2  58/9710 0.01191138 0.03880464 0.006807832 319757     1
mmu05217       1/2  63/9710 0.01293488 0.03880464 0.006807832 319757     1
mmu04270       1/2 144/9710 0.02944172 0.04989512 0.008753530 116903     1
> 
Shuixin-Li commented 2 months ago

Thank you for your detailed reply. Sorry for not using English before. but do you know why enrichKEGG does not support gson object, I am confused bacause I saw the code below. @guidohooiveld https://github.com/YuLab-SMU/clusterProfiler/blob/2ab30a92f1791dce75f71ea29b71c33fc443d4a0/R/enrichKEGG.R#L45-L58

I added gson_file@keytype <- 'ENTREZID' before running enrichKEGG(), and the error disappeared. But I am not sure whether the results are correct by doing this.


In terms of input data, sorry for showing the wrong data, I showed id_transform before, but I actually used id_transform[,1], which is exactly the character vector. Thank you for pointing out.

> head(id_transform)
        id_transform
SP140         434484
SPATA32       328019
SAMD15        238333
FER1L6        631797
RERGL         632971
PHEX           18675
> head(id_transform[,1])
[1] "434484" "328019" "238333" "631797" "632971" "18675" 
guidohooiveld commented 2 months ago

Sorry for my delayed reply!

Thanks for highlighting the relevant section in the source code from enrichKEGG. I now got what you tried to achieve, and agree with you that the GSON-object kk is somehow missing the keytype slot.

Indeed, when manually adding it (like you did) enrichKEGG works as expected. See code below.

> ## load library
> library(clusterProfiler)
> 
> ## some ids
> id_transform <- c("240427","12705","241770","102633301","319757","116903","72309")
> 
> ## generate GSON-object with pathway information
> kk <- gson_KEGG('mmu')
> 
> ## use GSON as input: FAILS!
> KEGG_enrich = enrichKEGG(gene = id_transform,
+                          organism=kk,
+                          use_internal_data = FALSE)
Error in (function (cl, name, valueClass)  : 
  assignment of an object of class “NULL” is not valid for @‘keytype’ in an object of class “enrichResult”; is(value, "character") is not TRUE
> 
> 
> ## check GSON-object
> kk
>> Gene Set: KEGG
>> 9710 genes annotated by 355 gene sets.
>> Species: mmu
>> Version: Release 110.0+/04-27, Apr 24
> 
> ## note that slot keytype is NULL!
> str(kk)
Formal class 'GSON' [package "gson"] with 9 slots
  ..@ gsid2gene    :'data.frame':       38640 obs. of  2 variables:
  .. ..$ gsid: chr [1:38640] "mmu00010" "mmu00010" "mmu00010" "mmu00010" ...
  .. ..$ gene: chr [1:38640] "103988" "106557" "110695" "11522" ...
  ..@ gsid2name    :'data.frame':       355 obs. of  2 variables:
  .. ..$ gsid: chr [1:355] "mmu01100" "mmu01200" "mmu01210" "mmu01212" ...
  .. ..$ name: chr [1:355] "Metabolic pathways - Mus musculus (house mouse)" "Carbon metabolism - Mus musculus (house mouse)" "2-Oxocarboxylic acid metabolism - Mus musculus (house mouse)" "Fatty acid metabolism - Mus musculus (house mouse)" ...
  ..@ gene2name    : NULL
  ..@ species      : chr "mmu"
  ..@ gsname       : chr "KEGG"
  ..@ version      : chr "Release 110.0+/04-27, Apr 24"
  ..@ accessed_date: chr "2024-04-30"
  ..@ keytype      : NULL
  ..@ info         : NULL
> 
> ## Fix, and check
> kk@keytype="kegg"
> 
> str(kk)
Formal class 'GSON' [package "gson"] with 9 slots
  ..@ gsid2gene    :'data.frame':       38640 obs. of  2 variables:
  .. ..$ gsid: chr [1:38640] "mmu00010" "mmu00010" "mmu00010" "mmu00010" ...
  .. ..$ gene: chr [1:38640] "103988" "106557" "110695" "11522" ...
  ..@ gsid2name    :'data.frame':       355 obs. of  2 variables:
  .. ..$ gsid: chr [1:355] "mmu01100" "mmu01200" "mmu01210" "mmu01212" ...
  .. ..$ name: chr [1:355] "Metabolic pathways - Mus musculus (house mouse)" "Carbon metabolism - Mus musculus (house mouse)" "2-Oxocarboxylic acid metabolism - Mus musculus (house mouse)" "Fatty acid metabolism - Mus musculus (house mouse)" ...
  ..@ gene2name    : NULL
  ..@ species      : chr "mmu"
  ..@ gsname       : chr "KEGG"
  ..@ version      : chr "Release 110.0+/04-27, Apr 24"
  ..@ accessed_date: chr "2024-04-30"
  ..@ keytype      : chr "kegg"
  ..@ info         : NULL
> 
> 
> ## enrichKEGG now works!
> KEGG_enrich = enrichKEGG(gene = id_transform,
+                          organism=kk,
+                          use_internal_data = FALSE)
> 
> KEGG_enrich
#
# over-representation test
#
#...@organism    mmu 
#...@ontology    KEGG 
#...@keytype     kegg 
#...@gene        chr [1:7] "240427" "12705" "241770" "102633301" "319757" "116903" "72309"
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...5 enriched terms found
'data.frame':   5 obs. of  11 variables:
 $ category   : chr  "Environmental Information Processing" "Human Diseases" "Organismal Systems" "Organismal Systems" ...
 $ subcategory: chr  "Signal transduction" "Cancer: specific types" "Circulatory system" "Development and regeneration" ...
 $ ID         : chr  "mmu04340" "mmu05217" "mmu04270" "mmu04360" ...
 $ Description: chr  "Hedgehog signaling pathway - Mus musculus (house mouse)" "Basal cell carcinoma - Mus musculus (house mouse)" "Vascular smooth muscle contraction - Mus musculus (house mouse)" "Axon guidance - Mus musculus (house mouse)" ...
 $ GeneRatio  : chr  "1/2" "1/2" "1/2" "1/2" ...
 $ BgRatio    : chr  "58/9710" "63/9710" "144/9710" "181/9710" ...
 $ pvalue     : num  0.0119 0.0129 0.0294 0.0369 0.0416
 $ p.adjust   : num  0.0388 0.0388 0.0499 0.0499 0.0499
 $ qvalue     : num  0.00681 0.00681 0.00875 0.00875 0.00875
 $ geneID     : chr  "319757" "319757" "116903" "319757" ...
 $ Count      : int  1 1 1 1 1
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
guidohooiveld commented 2 months ago

As you will see above I opened an issue on the GitHub of the gson package. https://github.com/YuLab-SMU/gson/issues/9