biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
171 stars 71 forks source link

Semantics of domain vs class slots #198

Closed cmungall closed 4 years ago

cmungall commented 5 years ago

I think there is currently an assumption that these were equivalent:

This is my fault for not being more explicit in the metamodel.

The actual intent was that instantiations of classes must be allowed by the stated slots.

E.g. if the instance graph has c r d, then it MUST be the case that:

From a pragmatic view this is nice for developers as they just have to look at the class and its superclasses to see what slots are allowed, i.e. normal OO.

The domain constraint adds an additional layer

Now, it gets interesting when we consider reification. I'll invent some definitions here:

This means that the slot constraint be relaxed, since a XtoYAssoc can 'inject' additional predicates into any X.

I will also make a ticket for myself to better specify the metamodel semantics

cmungall commented 5 years ago

Note one the implicit assumptions here is that the OO model is association-oriented and "heavyweight". You generally don't get to say:

g1 = Gene("g1")
g2 = Gene("g2")
g1.interacts_with(g2)

instead you have to make an association object. g1 and g2 don't directly "know" about each other. This also has the nice side effect of not worrying about reciprocity (e.g. in the above, should the client also instantiate the reciprocal link for g2?)

It does incur some extra cost for the client. It could also be argued that the choice of which things are associations vs direct slots is arbitrary. Why is "name" not an association?

We may later want to allow people flexibility in generating their UML/OM, e.g. they may opt to dereify some associations and have direct slots.

hsolbrig commented 5 years ago

I think this is a reason why, despite a lot of similarity, we need to separate the modeling language from the model. In the modeling language, I have yet to encounter a use case where:

slots:
   r:
     domain: C

is not exactly equivalent to:

classes:
    C:
        slots:
            - r

As an example, if I am defining Association:

slots:
    subject:
         domain: association
         range: slot_class_description
         required: true
         ...

classes:
     association:
            is_a: named_thing
           slots:
              - association_type
              - subject
              - relation
              - object
              - parent
              - edge_label
              - negated
              - has_confidence_level
              - has_evidence
              - provided_by
              - supporting_publications
              - association_mixins
              - apply_to_association

The problem arises, however, with the following set of statements:

slots:
  related to:
    description: A grouping for any relationship type that holds between any two things
    domain: named thing
    range: named thing

  interacts with:
    description: holds between any two entities that directly or indirectly interact with each other
    is_a: related to
    symmetric: true

This asserts for every instance of a named thing there can be at most one related to property the range of which must be a named thing. As interacts with is a kind of related to, one option would be to to use interacts with in place of related to. This NOT what we want. The problem is, in the modeling language, we have a set of needs that DO correspond with the above:

slots:
    name:
         domain: named_thing
         range: string
    slot_name:
         is_a: name
         domain: node

Same pattern, but this time the intended semantics are what we want. Every instance of named_thing may have at most one 'name' and, if the class is an instance of node, the name will be called slot_name. (Contrived but you get the idea...)

So what I concluded is that, as the semantics of the modeling language is significantly different from the semantics of the target model, we need to separate them.

In particular, until we get into the model element called Assocation, the discussion above doesn't actually make a lot of sense. There should be Python classes named "Association", "Node", "Relation", ... instances of which are GeneToGene. Gene and interacts_with respectively.

While this is, I believe, the correct approach, it is definitely a bit of a challenge to define and implement the KG model, which consists of instances of the following entities:

Note 1 : There is an interesting question in the model wrt. domain and range, currently, these appear in the association definitions:

associations:
  gene to gene association:
    aliases: ['molecular or genetic interaction']
    abstract: true                    <---------- 
    description: >-
      abstract parent class for different kinds of gene-gene or gene product to gene product relationships.
      Includes homology and interaction.
    subject:
      range: gene or gene product
      description: >-
        the subject gene in the association. If the relation is symmetric, subject vs object is arbitrary.
        We allow a gene product to stand as proxy for the gene or vice versa
      definitional: true
    object:
      range: gene or gene product
      description: >-
        the object gene in the association. If the relation is symmetric, subject vs object is arbitrary.
        We allow a gene product to stand as proxy for the gene or vice versa
      definitional: true

  gene to gene homology association:
    is_a: gene to gene association
    description: >-
      A homology association between two genes. May be orthology (in which case the species of subject and object
      should differ) or paralogy (in which case the species may be the same)
    subject:
      definitional: true
    relation:
      range: homologous to                          <-------
      description: homology relationship type
      definitional: true
    object:
      definitional: true

relations:
  homologous to:                                         <-------
    aliases: ['in homology relationship with']
    symmetric: true
    description: >-
      holds between two biological entities that have common evolutionary origin
    comments:
      - typically used to describe homology relationships between genes or gene products
    in_subset:
      - translator_minimal
    mappings:
      - RO:HOM0000001
      - SIO:010302

You will note that the domain and range are defined in the association. This allows inference. Given:

<http://data2services/model/association/carrier/08b0f41254f99fe99092848ca0acd921> a ns1:ChemicalToGeneAssociation ;
    ns1:affects "Human" ;
    ns1:object <http://identifiers.org/drugbank/BE0000438> ;
    ns1:publications <http://identifiers.org/pubmed/10652246> ;
    ns1:relation ns1:affects_transport_of ;
    ns1:subject <http://identifiers.org/drugbank/DB13751> .

We can infer that <http://identifiers.org/drugbank/DB13751> is subclass of chemical, http://identifiers.org/drugbank/BE0000438 is subclass of gene. Interestingly, the model says nothing at the moment about the relation -- one could put ANY relation in there. **We need to talk further about this, as the above example uses a biolink:affects_transport_of relationship and I'm not sure we want to be in the ontology business.

Were the above to be accompanied by more assertions:

<http://identifiers.org/drugbank/BE0000530> a ns1:GeneOrGeneProduct ;
    ns1:category <http://identifiers.org/pfam/PF00273> ;
    ns1:id "BE0000530" ;
    ns1:in_taxon <http://identifiers.org/taxonomy/9606> ;
    ns1:name "Serum albumin" ;
    ns1:same_as <http://identifiers.org/uniprot/P02768> .

We could either check for errors: we've inferred that BE0000530 is a Gene and asserted that it is a GeneOrGeneProduct -- are these compatible? or could do further inference -- Gene subClassOf GeneOrGeneProduct. Where should we go?

Note that the 'in_taxon' is an example of an extension slot. It is also an interesting case that you mentioned above -- should in_taxon be an association class, a property of GeneOrGeneProduct or (shudder) both?

BTW - the biolink_model had symmetric and inverse tags on the Association class -- I had moved them to the relationship class but this may not be the correct decision. Perhaps they belong in both places?

cmungall commented 5 years ago

@hsolbrig is this finally resolved? I think we are agreed that declaring slots should not entail declaring a domain?