Open tivdnbos opened 4 years ago
According to the authors, the simrel method is aimed at comparing gene products rather than functional profiles. Thus, generic terms are penalized: “Generic terms do not have a high relevance for the comparison of the exact function of different gene products.” In my opinion, this does not make sense for comparing profiles. The simrel method without the penalty becomes the simLin method.
I suggest to make a different branch where we test it with simLin. What do you think @rababerladuseladim @pverscha ?
I redid the analysis with the simLin metrik, to be found here: https://github.com/MEGA-GO/manuscript-data-analysis/tree/use_lin_metric Sample clustering is not affected, the ranges for the similarity change a bit towards higher levels.
When identical high-level terms are compared, a low score is returned, e.g.: GO:0030170 (pyridoxal phosphate binding) vs GO:0030170 gives 99% similarity GO:0043167 (ion binding) vs GO:0043167 gives 55% similarity GO:0003674 (molecular function) vs GO:0003674 gives 0% similarity
I also tested what happens if that term is multiple times in the list (e.g. 10x GO:0043167 vs 1x GO:0043167) but this gives the same result, 55% in this case