YuLab-SMU / GOSemSim

:golf: GO-terms Semantic Similarity Measures
https://yulab-smu.top/biomedical-knowledge-mining-book/
58 stars 26 forks source link

not all genes/proteins will show up in the similarity matrix #19

Closed benedettafrida closed 6 years ago

benedettafrida commented 6 years ago

Hi, I'm having some issues with the output of mgeneSim (but also mclusterSim). I'm trying to make a GO similarity analysis on some proteins in an all against all fashion (which for my understanding of the documentation is done by using either mgeneSim or mclusterSim. Problem is, not all the proteins are present in the output matrix.

Here is my code:

require(GOSemSim)
require(org.Hs.eg.db)

uniprotIDs<-c("Q8NF91", "Q8WXH0", "Q63HN8", "P21817", "Q92736", "A2VEC9", "Q8WZ42", "Q2LD37", "Q8NEZ4", "P58107", "Q5CZC0")
length(uniprotIDs)

hsGO2 <- godata('org.Hs.eg.db', keytype = "UNIPROT", ont="MF")

sim_matrix<-mgeneSim(uniprotIDs,uniprotIDs, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)
sqrt(length(sim_matrix))

the length of uniprotIDs =11 and using this code sqrt(length(sim_matrix))=10
I've also tried using:

sim_matrix<-mgeneSim(uniprotIDs, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)

But in this case sqrt(length(sim_matrix))=9

To double check that the problem didn't lie in the use of UniprotIDs I also tried using GeneIDs using the following code:

geneIds<-c("23345",  "23224",  "57674",  "6261",   "6262" ,  "23145",  "7273",   "84162",  "58508",  "83481",  "401024")

hsGO <- godata('org.Hs.eg.db', ont="MF")

sim_matrix2<-mclusterSim(geneIds,geneIds, semData = hsGO, measure = "Wang",combine="BMA")
sqrt(length(sim_matrix2))

With the same results. Can you tell me if I'm doing something wrong?

GuangchuangYu commented 6 years ago

not all genes/proteins have GO annotation.

benedettafrida commented 6 years ago

Hi, thanks for responding so promptly.

Yes sure, not all proteins are annotated. But everyone in the example does. The one that is left out from the matrix(Q5CZC0) has only one term but still has it: GO:0005739.

I also checked with other proteins with a single GO term, but the error is not reproducible. Here's the full list of protein I was testing my pipeline:

Q8NF91" "Q8WXH0" "Q63HN8" "P21817" "Q92736" "A2VEC9" "Q8WZ42" "Q2LD37" "Q8NEZ4" "P58107" "Q5CZC0" "Q6V0I7" "Q9Y6R7" "Q96RW7" "Q8NDA2" "Q4G0P3" "Q8WXG9" "Q9UPN3" "O14686" "Q03001" "P0DPF2" "Q9NU22" "Q5VST9" "Q9Y6V0" "P98088" "Q9HC84" "Q8WXI7" "Q7Z5P9" "Q02817" "Q9UKN1" "P20929" "O75445" "Q5T4S7" "Q8IVF2" "Q09666" "Q86UQ4"

They all have GO terms. Only Q5CZC0 Q9Y6R7 and P0DPF2 has a single GO entry.

by using all of them (36) as in sim_matrix<-mgeneSim(b,b, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE) where b is the the full list above, I get a 30by30 matrix.

GuangchuangYu commented 6 years ago

for Q5CZC0 Q9Y6R7 and P0DPF2, only the second one has 1 GO term annotated.

> geneSim(gene[1], gene[1], hsGO2)
[1] NA
> geneSim(gene[2], gene[2], hsGO2)
$geneSim
[1] 1

$GO1
[1] "GO:0005515"

$GO2
[1] "GO:0005515"

> geneSim(gene[3], gene[3], hsGO2)
[1] NA
> gene
[1] "Q5CZC0" "Q9Y6R7" "P0DPF2"
benedettafrida commented 6 years ago

I see. Then there's a discrepancy between what is retrieved from UniProtKB/SwissProt and method you're using for fetching GO terms.

Here's what I downloaded from Uniprot. fromUniprot.txt

Is there a way of making multiple genes comparisons directly by inputting the GO terms? In the documentation, I only saw mgoSim which allows pairwise comparison (if I understood correctly).

GuangchuangYu commented 6 years ago

Yes, you can use mgoSim