Design of semantic similarity data model

INCATools / ontology-access-kit

Ontology Access Kit: A python library and command line application for working with ontologies

https://incatools.github.io/ontology-access-kit/

Apache License 2.0

110 stars 26 forks source link

Design of semantic similarity data model #169

Open matentzn opened 2 years ago

matentzn commented 2 years ago

Talking about: https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/datamodels/similarity.yaml

This looks like for every similarity algorithm we want to implement, we will define a new slot.

Isn't it better to have an enum for the similarity algorithm (jaccard, resnik, phenodigm) and one slot semantic_similarity_score for it (and another for the value semantic_similarity_value). Like this people implementing the datamodel in their own systems do not need to extend it every time a new method is added..

matentzn commented 2 years ago

Also we dont need to change the data model every time we implement a new algorithm!

cmungall commented 2 years ago

I think there is a well known set of core metrics, and adding new ones is easy, just add a line to the yaml. Modeling each metric as a property/field/slot is natural and simple and allows for simple code like:

if sim.jaccard_score > thresh:

rather than using reification and

vals = [m.value for m in sim.metrics if m.type == ScoreEnum.JACCARD]
if vals and v > thresh:
   ....

matentzn commented 2 years ago

I dont know, jaccard all by itself means very little. compared to what? parents? ancestors? graph neighbourhood? I would want to be able to document exactly the implementation, similar to mapping_tool and mapping_tool_version. Plus, once we get into the realms of embeddings similarity there are so many different implementations. I would still prefer a generic field. There are dozens of algorithms listed just on Wikipedia, and if we want to promote this as an exchange format for semantic similarity information rather than SSSOM, it should be more easily extensible then having to change the schema by downstream users that just want to document a different algorithm (using, say the semantic similarity ontology you started to build).

cmungall commented 2 years ago

I dont know, jaccard all by itself means very little. compared to what? parents? ancestors? graph neighbourhood?

This sounds like a documentation deficit. It's the ancestors, parameterized by a predicates list, and this metadata can go in the header

But anyway, it sounds like this is a classic wide table vs EAV model distinction. I think we can do both. Stay tuned...

matentzn commented 1 year ago

Another argument for EAV is that we want to build a data science pipeline to compare semantic similarity profiles. I want to define the notion of a semantic similarity profile as a table <s,o,v,a> (subject, object, similarity score, algorithm) akin to a mapping, and then define generic operations to compare two profiles (for example, one based on embedding-cosine similarity and one on jaccard), like a "diff". We want to be able to characterise reasonably precisely the differences between the two (average difference, lost and gained links over some threshold If we have the approaches as columns (slots), the comparison code cannot easily be generic (excluding the possibility of some preprocessing step with SchemaView, which immediately creates a linkML dependency for no great reason).