YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1.03k stars 256 forks source link

A strange non-given GO comment appears #697

Open Chewxway opened 5 months ago

Chewxway commented 5 months ago

When clusterProfiler was enriched, the genes in the R package did not have this GO annotation. After enrichment, this gene appeared in the GO entry in the result. I want to know how the corresponding relationship between this GO and gene was obtained. For example, the mapping between GO:0045332 and gene-CORT_0A00350 does not exist in the database, right? But it did appear in result: GO:0045332 gene-CORT_0E06570/gene-CORT_0B05830/gene-CORT_0B05840/gene-CORT_0E06580/gene-CORT_0C01240/gene-CORT_0A00350

guidohooiveld commented 5 months ago

What is the outcome of: select(org.Corthopsilosis.eg.db, keys = gene_id, columns = "GOALL", keytype = "GID")

Thus GOALL versusGO.

Does GO:0045332 then appear?

Chewxway commented 5 months ago

Hope to get your answer: select(org.Corthopsilosis.eg.db, keys = gene_id, columns = "GOALL", keytype = "GID") The result is

GID GO 1 gene-CORT_0A00350 GO:0000138 2 gene-CORT_0A00350 GO:0000322 3 gene-CORT_0A00350 GO:0000323 4 gene-CORT_0A00350 GO:0000324 5 gene-CORT_0A00350 GO:0001558 6 gene-CORT_0A00350 GO:0003674 7 gene-CORT_0A00350 GO:0005215 8 gene-CORT_0A00350 GO:0005575 9 gene-CORT_0A00350 GO:0005622 10 gene-CORT_0A00350 GO:0005737 11 gene-CORT_0A00350 GO:0005773 12 gene-CORT_0A00350 GO:0005777 13 gene-CORT_0A00350 GO:0005794 14 gene-CORT_0A00350 GO:0005795 15 gene-CORT_0A00350 GO:0005886 16 gene-CORT_0A00350 GO:0005887 17 gene-CORT_0A00350 GO:0006810 18 gene-CORT_0A00350 GO:0006857 19 gene-CORT_0A00350 GO:0006996 20 gene-CORT_0A00350 GO:0007033 21 gene-CORT_0A00350 GO:0008150 22 gene-CORT_0A00350 GO:0009987 23 gene-CORT_0A00350 GO:0012505 24 gene-CORT_0A00350 GO:0015833 25 gene-CORT_0A00350 GO:0016020 26 gene-CORT_0A00350 GO:0016021 27 gene-CORT_0A00350 GO:0016043 28 gene-CORT_0A00350 GO:0019725 29 gene-CORT_0A00350 GO:0022603 30 gene-CORT_0A00350 GO:0022604 31 gene-CORT_0A00350 GO:0022857 32 gene-CORT_0A00350 GO:0030307 33 gene-CORT_0A00350 GO:0031224 34 gene-CORT_0A00350 GO:0031226 35 gene-CORT_0A00350 GO:0031984 36 gene-CORT_0A00350 GO:0031985 37 gene-CORT_0A00350 GO:0032368 38 gene-CORT_0A00350 GO:0032879 39 gene-CORT_0A00350 GO:0033043 40 gene-CORT_0A00350 GO:0035672 41 gene-CORT_0A00350 GO:0035673 42 gene-CORT_0A00350 GO:0040008 43 gene-CORT_0A00350 GO:0042144 44 gene-CORT_0A00350 GO:0042579 45 gene-CORT_0A00350 GO:0042592 46 gene-CORT_0A00350 GO:0042886 47 gene-CORT_0A00350 GO:0042887 48 gene-CORT_0A00350 GO:0043226 49 gene-CORT_0A00350 GO:0043227 50 gene-CORT_0A00350 GO:0043229 51 gene-CORT_0A00350 GO:0043231 52 gene-CORT_0A00350 GO:0043269 53 gene-CORT_0A00350 GO:0044070 54 gene-CORT_0A00350 GO:0044088 55 gene-CORT_0A00350 GO:0045454 56 gene-CORT_0A00350 GO:0045927 57 gene-CORT_0A00350 GO:0048284 58 gene-CORT_0A00350 GO:0048518 59 gene-CORT_0A00350 GO:0048522 60 gene-CORT_0A00350 GO:0048638 61 gene-CORT_0A00350 GO:0048639 62 gene-CORT_0A00350 GO:0050789 63 gene-CORT_0A00350 GO:0050793 64 gene-CORT_0A00350 GO:0050794 65 gene-CORT_0A00350 GO:0051049 66 gene-CORT_0A00350 GO:0051094 67 gene-CORT_0A00350 GO:0051128 68 gene-CORT_0A00350 GO:0051130 69 gene-CORT_0A00350 GO:0051179 70 gene-CORT_0A00350 GO:0051234 71 gene-CORT_0A00350 GO:0051510 72 gene-CORT_0A00350 GO:0051512 73 gene-CORT_0A00350 GO:0051513 74 gene-CORT_0A00350 GO:0051515 75 gene-CORT_0A00350 GO:0055085 76 gene-CORT_0A00350 GO:0061024 77 gene-CORT_0A00350 GO:0061025 78 gene-CORT_0A00350 GO:0061091 79 gene-CORT_0A00350 GO:0065007 80 gene-CORT_0A00350 GO:0065008 81 gene-CORT_0A00350 GO:0071702 82 gene-CORT_0A00350 GO:0071705 83 gene-CORT_0A00350 GO:0071840 84 gene-CORT_0A00350 GO:0071944 85 gene-CORT_0A00350 GO:0097035 86 gene-CORT_0A00350 GO:0097576 87 gene-CORT_0A00350 GO:0098791 88 gene-CORT_0A00350 GO:1904680 89 gene-CORT_0A00350 GO:1905952 90 gene-CORT_0A00350 GO:2001138

Therefore, we can determine that the correspondence between gene-CORT_0A00350 and GO:0045332 is not in the database correspondence. In order to ensure the accuracy and interpretability of the results, we need to know how this correspondence is generated, thank you!

We now provide background and genename files for enrichment !

genes.txt Gene2GOs.map.txt

Chewxway commented 5 months ago

org.Corthopsilosis.eg.db was build by these code:

    library(AnnotationForge)
    rm(list=ls())
    setwd(".")
    geneID2GO <- read.table(file = 'Gene2GOs.map.txt', header = FALSE)
    colnames(geneID2GO) <- c("GID", "GO")
    fSym <- transform(geneID2GO, SYMBOL = GID, GENENAME = GID)[, c("GID","SYMBOL","GENENAME")]
    fSym <- fSym[fSym[,2]!="-",]
    fSym <- fSym[fSym[,3]!="-",]
    fSym <- fSym[!duplicated(fSym),]
    dim(fSym)
    fGO <- transform(geneID2GO, EVIDENCE=rep("", length(GID)))
    fGO <- fGO[!duplicated(fGO),]
    dim(fGO)
    makeOrgPackage(gene_info=fSym, go=fGO,
                   version="0.2",
                   maintainer="Chewxway <Chewxway@163.com>",   
                   author="Chewxway <Chewxway@163.com>",
                   outputDir = ".",
                   tax_id="0000",
                   genus="Candida",
                   species="orthopsilosis",
                   goTable="go")

    install.packages("./org.Corthopsilosis.eg.db", type="source", repos=NULL)
guidohooiveld commented 5 months ago

I generated the OrgDb according to your input file and code.

After having a quick look I noticed that I do get a different number of GO-terms annotated to gene-CORT_0A00350 for both go and goall.... ?? GO: 86 (me) vs 90 (you) GOALL: 114 (me) vs 90 (again, you)

Also, the file genes.txt is empty / not available. Could you please reupload?

> dim( select(org.Corthopsilosis.eg.db, keys = c("gene-CORT_0A00350"), columns = "GO", keytype = "GID") )
'select()' returned 1:many mapping between keys and columns
[1] 86  2
> 
> dim( select(org.Corthopsilosis.eg.db, keys = c("gene-CORT_0A00350"), columns = "GOALL", keytype = "GID") )
'select()' returned 1:many mapping between keys and columns
[1] 114   2
> 
Chewxway commented 5 months ago

Dear guidohooiveld,

Thank you for your attention to detail; indeed, I didn't notice the inconsistency in the quantity relationships after building the data. Although the inconsistency in these relationships post-database construction is also a problem, I personally find it understandable if some GO terms were discarded due to missing descriptive information. In fact, in the map file, the gene gene-CORT_0A00350 actually has 99 annotations (refer to PUMC-chewxway_2024-06-20_8_59_52.log). Naturally, we hope to achieve completely consistent results, but under the current circumstances, we are more interested in resolving the seemingly fabricated correspondence issue. This time, I will upload the relevant files, including the log information (PUMC-chewxway_2024-06-20_8_59_52.log), packaged to ensure the upload process goes smoothly, hoping this helps you reproduce the problem.

Please note that this time, we have also packaged the constructed database. You can use the following command to extract it:

tar zxvf org.Corthopsilosis.eg.db.tar.gz

After extracting, you should be able to install it directly in R using:

install.packages("./org.Corthopsilosis.eg.db", type="source", repos=NULL) Following these steps, you should obtain a consistent database.

Here is the compressed file containing all the files. toguidohooiveld.tar.gz toguidohooiveld.tar.gz.md5.txt

With best wishes

Chewxway commented 5 months ago

I generated the OrgDb according to your input file and code.

After having a quick look I noticed that I do get a different number of GO-terms annotated to gene-CORT_0A00350 for both go and goall.... ?? GO: 86 (me) vs 90 (you) GOALL: 114 (me) vs 90 (again, you)

Also, the file genes.txt is empty / not available. Could you please reupload?

> dim( select(org.Corthopsilosis.eg.db, keys = c("gene-CORT_0A00350"), columns = "GO", keytype = "GID") )
'select()' returned 1:many mapping between keys and columns
[1] 86  2
> 
> dim( select(org.Corthopsilosis.eg.db, keys = c("gene-CORT_0A00350"), columns = "GOALL", keytype = "GID") )
'select()' returned 1:many mapping between keys and columns
[1] 114   2
> 

Dear guidohooiveld,

I just rechecked the GOALL results:

dim( select(org.Corthopsilosis.eg.db, keys = c("gene-CORT_0A00350"), columns = "GO", keytype = "GID") ) 'select()' returned 1:many mapping between keys and columns [1] 90 2 dim( select(org.Corthopsilosis.eg.db, keys = c("gene-CORT_0A00350"), columns = "GOALL", keytype = "GID") ) 'select()' returned 1:many mapping between keys and columns [1] 110 2

Now I find that there is a relationship between GO:0045332 and gene-CORT_0A00350 in the GOALL result, but this relationship does not exist in the map file I provided (Only 99 GO associations exist in the map file).

Can I assume that this is an error in the process of building the database? Or maybe this correspondence is correct, but I think we need to get evidence of this new correspondence.

If the evidence is insufficient, then I will want to use the function enrichGO () to compute only the word "GO" and not "GOALL". Is that possible?

With best wishes

guidohooiveld commented 5 months ago

Sorry for my delayed reply.

** The fact that some GO terms are not in the OrgDb is because under the hood the OrgDb is populated by GO terms that should be present in Bioconductor's GO.db annotation package. If an input GO terms is not present in that database, it will be discarded. See: https://github.com/Bioconductor/AnnotationForge/blob/007e1ad51972462742b99868cc68e5cad598a5e2/R/makeOrgPackage.R#L100-L102

** The building of the database is correct, because it takes into account the parent-child relationships of the GO annotation. Recall that "... The structure of GO can be described in terms of a graph, where each GO term is a node, and the relationships between the terms are edges between the nodes. GO is loosely hierarchical, with ‘child’ terms being more specialized than their ‘parent’ terms, but unlike a strict hierarchy, a term may have more than one parent term". Source.

** You cannot use the function enrichGO with only the direct GO (GO) annotations; the function only runs taking into account the direct annotation as well as its ancestors (GOALL). See: https://github.com/YuLab-SMU/clusterProfiler/blob/335235254697dee30e17ac1be1a2d0dd824f46ca/R/enrichGO.R#L160-L162

AFAIK it is common practice to perform GO-based overrepresentation analysis including the ancestors (also called parent-child approach) because using only direct annotation (also called term-for-term approach) result in flawed outcomes. See e.g. https://doi.org/10.1093/bioinformatics/btm440

Chewxway commented 5 months ago

Dear guidohooiveld,

We agree with the approach of performing GO-based overrepresentation analysis using the parent-child method rather than relying solely on direct annotations. This holistic approach, as highlighted in your references, mitigates potential biases and provides more robust outcomes.

resolved #697

Thank you for your time and valuable insights. Best regards, Department of Laboratory Medicine Team, PUMCH