YuLab-SMU / GOSemSim

:golf: GO-terms Semantic Similarity Measures
https://yulab-smu.top/biomedical-knowledge-mining-book/
58 stars 26 forks source link

missing genes in output matrix #20

Closed benedettafrida closed 6 years ago

benedettafrida commented 6 years ago

Sorry for opening a new issue, but you closed it before I had the time to reply. I'm copying the previous text and my follow up below.

I'm having some issues with the output of mgeneSim (but also mclusterSim). I'm trying to make a GO similarity analysis on some proteins in an all against all fashion (which for my understanding of the documentation is done by using either mgeneSim or mclusterSim. Problem is, not all the proteins are present in the output matrix.

Here is my code:

require(GOSemSim)
require(org.Hs.eg.db)

uniprotIDs<-c("Q8NF91", "Q8WXH0", "Q63HN8", "P21817", "Q92736", "A2VEC9", "Q8WZ42", "Q2LD37", "Q8NEZ4", "P58107", "Q5CZC0")
length(uniprotIDs)

hsGO2 <- godata('org.Hs.eg.db', keytype = "UNIPROT", ont="MF")

sim_matrix<-mgeneSim(uniprotIDs,uniprotIDs, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)
sqrt(length(sim_matrix))

the length of uniprotIDs =11 and using this code sqrt(length(sim_matrix))=10
I've also tried using:

sim_matrix<-mgeneSim(uniprotIDs, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)

But in this case sqrt(length(sim_matrix))=9

To double check that the problem didn't lie in the use of UniprotIDs I also tried using GeneIDs using the following code:

geneIds<-c("23345",  "23224",  "57674",  "6261",   "6262" ,  "23145",  "7273",   "84162",  "58508",  "83481",  "401024")

hsGO <- godata('org.Hs.eg.db', ont="MF")

sim_matrix2<-mclusterSim(geneIds,geneIds, semData = hsGO, measure = "Wang",combine="BMA")
sqrt(length(sim_matrix2))

With the same results. Can you tell me if I'm doing something wrong?

To which you replied :

not all genes/proteins have GO annotation.

Yes sure, not all proteins are annotated. But everyone in the example does. The one that is left out from the matrix(Q5CZC0) has only one term but still has it: GO:0005739.

I also checked with other proteins with a single GO term, but the error is not reproducible. Here's the full list of protein I was testing my pipeline:

Q8NF91" "Q8WXH0" "Q63HN8" "P21817" "Q92736" "A2VEC9" "Q8WZ42" "Q2LD37" "Q8NEZ4" "P58107" "Q5CZC0" "Q6V0I7" "Q9Y6R7" "Q96RW7" "Q8NDA2" "Q4G0P3" "Q8WXG9" "Q9UPN3" "O14686" "Q03001" "P0DPF2" "Q9NU22" "Q5VST9" "Q9Y6V0" "P98088" "Q9HC84" "Q8WXI7" "Q7Z5P9" "Q02817" "Q9UKN1" "P20929" "O75445" "Q5T4S7" "Q8IVF2" "Q09666" "Q86UQ4"

They all have GO terms. Only Q5CZC0 Q9Y6R7 and P0DPF2 has a single GO entry.

by using all of them (36) as in sim_matrix<-mgeneSim(b,b, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE) where b is the the full list above, I get a 30by30 matrix.

GuangchuangYu commented 6 years ago

https://github.com/GuangchuangYu/GOSemSim/issues/19