Any datasets or projects of similarity scores for pairs of EFO terms?

dhimmel commented 3 years ago

At @related-sciences, we created a Python library called nxontology for computing intrinsic semantic similarity between pairs of nodes in an ontology. "Intrinsic semantic similarity" refers to measures that derive only from the structure of the ontology graph itself, without using outside information. We'd like to benchmark the various metrics we compute against other similarity measures between EFO terms.

Hence my question: is anyone aware of projects that have computed similarity scores between EFO terms (primarily interested in traits/diseases)?

These similarity scores could be derived from:

lexical similarity, for example whether descriptions in the ontology are similar
corpus-based cooccurrence: for example whether the terms are mentioned together in the literature. Perhaps something that uses Zooma
genetic based similarity, perhaps something from the GWAS Catalog realm
target based similarity like the Open Targets disease similarity score (manuscript).
any other type of similarity, preferably for pairs of nodes, but clusters might be okay

Thanks for pointing me in the right direction. Hoping to avoid reinventing the wheel if these methods and datasets exist. CC @d0choa @matentzn @zoependlington

d0choa commented 3 years ago

We are very interested in this problem as we are about to deprecate the Open Targets Platform disease similarity score based on targets.

On top of the bullet points above, we have also previously explored phenotypic similarity between disease terms, based on their linked phenotypes and their specificity. Although it was a short exploratory project it produced very promising results for the pairs of terms that have a minimum number of annotated phenotypes. We used disease-phenotype links from Monarch and OntologyX/OntologySimilarity algorithm.

dhimmel commented 3 years ago

we are about to deprecate the Open Targets Platform disease similarity score based on targets

Do you intend to replace them with the phenotype derived similarity scores? Or just deprecate the target-derived scores because they are unreliable?

we have also previously explored phenotypic similarity between disease terms, based on their linked phenotypes and their specificity

Is any of this data available? For our initial use cases, it'd be okay if it only covered a subset of (like those with many phenotypes).

One approach I applied in the past looked at genetic similarity between Disease Ontology terms. It worked very well, especially after using a random walk to defuse similarity scores to robust proximity scores. The method is online at https://github.com/dhimmel/hodgkins and was based on GWAS Catalog data. The nice thing was that we overlapped loci between diseases based on genomic coordinates, avoiding the hard problem of variant to gene conversion. Probably a more sophisticated approach would be possible with OTGs fine mapping.

I've also been impressed with the performance of MeSH topic co-occurrence for computing disease similarity. But this approach is a bit harder with EFO, since not all terms will have one-to-one correspondence with MeSH terms.

matentzn commented 3 years ago

The Monarch Initiative does a lot of work with phenotype and disease similarity (https://monarchinitiative.org/); there are various libraries and tools floating around (owlsim), I think some key pieces of the Monarch API are driven by it (https://api.monarchinitiative.org/api).

Phenodigm is widely used by sources in open targets and monarch (also the main algorithm in exomiser). Worth looking at!

But yeah. Nothing comes to mind to address your question, but maybe @LLTommy has an idea?

EBISPOT / efo

Any datasets or projects of similarity scores for pairs of EFO terms? #912