YuLab-SMU / GOSemSim

:golf: GO-terms Semantic Similarity Measures
https://guangchuangyu.github.io/software/GOSemSim
58 stars 26 forks source link

results from function `mgoSim` based on "Wang" method #30

Open DeepColin opened 3 years ago

DeepColin commented 3 years ago

I have found that different types of hsGO have no effect on the similarity results. R codes as below:

go1 <- c("GO:0000005","GO:0000007") # MF
go2 <- c("GO:0005385", "GO:0004553") # MF
go3 <- c("GO:0000017", "GO:0000014") # BP + MF

hsGO <- godata('org.Hs.eg.db', ont="BP", computeIC=FALSE)
mgoSim(go1,go2,semData=hsGO,measure="Wang",combine="BMA") # 0.473
mgoSim(go1,go3,semData=hsGO,measure="Wang",combine="BMA") # 0.016
hsGO <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE)
mgoSim(go1,go2,semData=hsGO,measure="Wang",combine="BMA") # 0.473
mgoSim(go1,go3,semData=hsGO,measure="Wang",combine="BMA") # 0.016
hsGO <- godata('org.Hs.eg.db', ont="CC", computeIC=FALSE)
mgoSim(go1,go2,semData=hsGO,measure="Wang",combine="BMA") # 0.473
mgoSim(go1,go3,semData=hsGO,measure="Wang",combine="BMA") # 0.016

Are these results reasonable?

QianqianLiang commented 3 years ago

I have been having similar issues with Wang method. When I input the same terms with different ontologies, I always get back exactly the same results. I went through the functions in WangMethod.R and found the following line in getSV function: line 58-61:

  if( exists(ID, envir=.SemSimCache) ) {
    sv <- get(ID, envir=.SemSimCache)
    return(sv)
  }

line 108-112:

if( ! exists(ID, envir=.SemSimCache) ) {
    assign(ID,
           sv,
           envir=.SemSimCache)
  }

It stores the Semantic Value of an ID into the .SemSimCache environment once you run it. The next time you want to get the Semantic Value of the same ID it will automatically retrieve it from the environment rather than run it again. The problem with this is that, if you want to retrieve the semantic value of the same ID in different ontologies, it will always give you back the one you first run it. A quick way to prevent this is that you clear the .SemSimCache environment before you run the second one with the same ID. In your case you basically can do:

remove(list = ls(envir = .SemSimCache), envir = .SemSimCache)
hsGO <- godata(‘org.Hs.eg.db’, ont=“BP”, computeIC=FALSE)
mgoSim(go1,go2,semData=hsGO,measure=“Wang”,combine=“BMA”) # 0
mgoSim(go1,go3,semData=hsGO,measure=“Wang”,combine=“BMA”) # 0
remove(list = ls(envir = .SemSimCache), envir = .SemSimCache)
hsGO <- godata(‘org.Hs.eg.db’, ont=“MF”, computeIC=FALSE)
mgoSim(go1,go2,semData=hsGO,measure=“Wang”,combine=“BMA”) # 0.473
mgoSim(go1,go3,semData=hsGO,measure=“Wang”,combine=“BMA”) # 0.016
remove(list = ls(envir = .SemSimCache), envir = .SemSimCache)
hsGO <- godata(‘org.Hs.eg.db’, ont=“CC”, computeIC=FALSE)
mgoSim(go1,go2,semData=hsGO,measure=“Wang”,combine=“BMA”) # 0
mgoSim(go1,go3,semData=hsGO,measure=“Wang”,combine=“BMA”) # 0

I also attached the script with the modified getSV function (lightly tested). I store ID along with the ontology so it will retrieve the stored ones only if the input has both the same ID and ontology.

getSV <- function(ID, ont, rel_df, weight=NULL) {
  ID_ont = paste(ID, ont, sep = “:”)
  if (!exists(“.SemSimCache”)) .initial()
  .SemSimCache <- get(“.SemSimCache”, envir=.GlobalEnv)
  if( exists(ID_ont, envir=.SemSimCache) ) {
    sv <- get(ID_ont, envir=.SemSimCache)
    return(sv)
  }
  if (ont == “DO”) {
    topNode <- “DOID:4"
  } else {
    topNode <- “all”
  }
  if (ID == topNode) {
    sv <- 1
    names(sv) <- topNode
    return (sv)
  }
  if (is.null(weight)) {
    weight <- c(0.8, 0.6, 0.7)
    names(weight) <- c(“is_a”, “part_of”, “other”)
  }
  rel_df <- rel_df[rel_df$Ontology == ont,]
  if (! ‘relationship’ %in% colnames(rel_df))
    rel_df$relationship <- “other”
  rel_df$relationship[!rel_df$relationship %in% c(“is_a”, “part_of”)] <- “other”
  sv <- 1
  names(sv) <- ID
  allid <- ID
  idx <- which(rel_df[,1] %in% ID)
  while (length(idx) != 0) {
    p <- rel_df[idx,]
    pid <- p$parent
    allid <- c(allid, pid)
    sv <- c(sv, weight[p$relationship]*sv[p[,1]])
    names(sv) <- allid
    idx <- which(rel_df[,1] %in% pid)
  }
  sv <- sv[!is.na(names(sv))]
  sv <- sv[!duplicated(names(sv))]
  if(ont != “DO”)
    sv[topNode] <- 0
  if( ! exists(ID_ont, envir=.SemSimCache) ) {
    assign(ID_ont,
           sv,
           envir=.SemSimCache)
  }
  return(sv)
}