YuLab-SMU / GOSemSim

:golf: GO-terms Semantic Similarity Measures
https://guangchuangyu.github.io/software/GOSemSim
58 stars 26 forks source link

mgeneSim with Resnik measure produces incorrect results #23

Open deniseduma opened 5 years ago

deniseduma commented 5 years ago

Hi,

I'm trying to use GOSemSim to compute pairwise similarity scores for a large set of genes and it seems that the Resnik measure at least is incorrect because the self-similarities on the diagonal which should all be 1 are not 1 actually!

Here is the command I run

d = godata('org.Hs.eg.db', ont="BP", computeIC=TRUE) gsimmat3 = mgeneSim(my_gene_list, semData=d, drop="NULL", measure="Resnik", combine="BMA") print(gsimmat3)

and here is the output

[1] "gsimmat3" 150 151 152 1814 1815 1816 150 0.615 0.502 0.503 0.407 0.388 0.364 151 0.502 0.623 0.594 0.360 0.359 0.379 152 0.503 0.594 0.626 0.357 0.353 0.364 1814 0.407 0.360 0.357 0.678 0.509 0.451 1815 0.388 0.359 0.353 0.509 0.669 0.398 1816 0.364 0.379 0.364 0.451 0.398 0.680

As you can see the diagonal is not 1 as it should be!

Could you please fix this?

Thank you, Denise

llrs commented 5 years ago

I can reproduce this:

library("GOSemSim")
#> 
#> GOSemSim v2.10.0  For help: https://guangchuangyu.github.io/GOSemSim
#> 
#> If you use GOSemSim in published research, please cite:
#> Guangchuang Yu, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, Shengqi Wang. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics 2010, 26(7):976-978
my_gene_list <- c("150", "151", "152", "1814", "1815", "1816")
d = godata('org.Hs.eg.db', ont="BP", computeIC=TRUE)
#> Loading required package: org.Hs.eg.db
#> Loading required package: AnnotationDbi
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter,
#>     Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
#>     setdiff, sort, table, tapply, union, unique, unsplit, which,
#>     which.max, which.min
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: IRanges
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> 
#> preparing gene to GO mapping data...
#> preparing IC data...
gsimmat3 = mgeneSim(my_gene_list, semData=d, drop="NULL", measure="Resnik", combine="BMA")
#> |=================================================================| 100%
gsimmat3
#>        150   151   152  1814  1815  1816
#> 150  0.638 0.502 0.502 0.376 0.379 0.352
#> 151  0.502 0.641 0.605 0.311 0.322 0.349
#> 152  0.502 0.605 0.645 0.311 0.320 0.335
#> 1814 0.376 0.311 0.311 0.667 0.507 0.424
#> 1815 0.379 0.322 0.320 0.507 0.673 0.408
#> 1816 0.352 0.349 0.335 0.424 0.408 0.701

Created on 2019-06-07 by the reprex package (v0.3.0)

I though it could be an error on mgeneSim but it turns that it happens the same on geneSim:

lr <- geneSim("150", "150", semData = d, measure = "Resnik", combine = "BMA")
lr$geneSim
#> [1] 0.639

The problem seems to come from the Resnik similarity as calculating the similarity between the same GOs with Resnik reports a value different from 0:

mgoSim(lr$GO1[1:5], lr$GO2[1:5], semData = d, measure = "Resnik", combine = NULL)
#>            GO:0001819 GO:0007186 GO:0007189 GO:0007193 GO:0007265
#> GO:0001819      0.415      0.075      0.075      0.075      0.075
#> GO:0007186      0.075      0.367      0.367      0.367      0.181
#> GO:0007189      0.075      0.367      0.594      0.542      0.276
#> GO:0007193      0.075      0.367      0.542      0.627      0.181
#> GO:0007265      0.075      0.181      0.276      0.181      0.495

However I'm not sure if the Resnik similarity between the same element must be 1. I couldn't understand what happens in that case from the original article.

serbulent commented 4 years ago

I also can reproduce this with Rel also.

hsGO <- godata('org.Hs.eg.db', ont="BP", keytype = "UNIPROT") mgeneSim(genes=c("A1E959") , semData=hsGO, measure="Rel",verbose=TRUE, drop = "IEA" )

|======================================================================| 100%
              A1E959
A1E959  0.983
Pmaj7 commented 11 months ago

I'd like to confirm that I'm having a similar issue with Resnik scores not being calculated correctly.

library(GOSemSim)
library(org.Hs.eg.db)

GO.org <- org.Hs.eg.db
GO.anno <- godata(GO.org, ont = "BP")

# Term 1 vs. Term 2
> goSim("GO:0042060", "GO:0061041", semData = GO.anno, measure = "Resnik")
[1] 0.469

# Term 1 vs. Term 1
> goSim("GO:0042060", "GO:0042060", semData = GO.anno, measure = "Resnik")
[1] 0.469

# Term 2 vs. Term 2
> goSim("GO:0061041", "GO:0061041", semData = GO.anno, measure = "Resnik")
[1] 0.569

As you can see, these values make no sense, and should be much greater, especially if one term is being compared to itself. When testing Resnik similarity using NaviGO's web portal, it accurately determines Term 1 (wound healing) and Term 2 (regulation of wound healing) are highly similar with a Resnik score of 2.4.

I'm surprised there aren't more people raising concerns about this, but perhaps Resnik isn't used by many and most just use the default Wang method. This just goes to show, never blindly believe the results you get from one tool/package. Always make sure you verify results using tools from multiple devs to be sure what you get from any one tool is accurate.