Document n-ary relationship design patterns

cmungall commented 3 years ago

We frequently have need for what can be viewed as ternary or n-ary relationsips between >2 entities. Examples:

gene correlated with gene (in a specific tissue) see #324
chemical derived from chemical (mediated by a gene product or activity)
gene interacts with gene (in the context of a specific pathway)

I'm considering provenance and evidence entities distinct although there are cases where this is hard to separate

We should document the design patterns here. This would draw from https://www.w3.org/TR/swbp-n-aryRelations/ but may be different as we are in a property graph formalism

Broadly speaking, there are usually different ways to model these things. Often there is no correct way. Often there is a more granular representation, and a 'shortcut' representation.

For example, for the first example above (see https://github.com/biolink/biolink-model/issues/324#issuecomment-611193591 for deeper discussion)

Granular representation: have an expression event node e. g1 and g2 participate in e. e can have any number of edges emanating from it. E.g. site of expression, time, experimental conditions...
Canonical Shortcut: have an edge between the two genes and have the site of expression etc be edge properties
Multi-Shortcut: g1-g2 edge, g1-site, g2-site edges. This is similar to the wikidata reification model.

For the pathway example, there may be even more granular representations, e.g. GO-CAMs

There are use cases for doing things one way, and other use cases for another way.

We can approach this by either:

defining a canonical way (usually the more granular) and allowing applications to have their own local ways of doing shortcuts
define both canonical and shortcut ways in the model, and provide rules for interconversion

I am leaning towards the 2nd, but we should discuss on a call

In either case, there are broad design patterns that should be documented

cbizon commented 3 years ago

define both canonical and shortcut ways in the model, and provide rules for interconversion

I think that this is the right thing to do.

I'm not sure I understand the Multi-Shortcut way listed above. Is there some edge property or something that ties together that set of edges so that you know that they go together?

cmungall commented 3 years ago

The multishortcut was your idea :-) but I omitted previous details you said from the ticket:

having 3 pairwise edges (geneA-geneB, geneA-site, geneB-site) and putting a common hyperedge id on them to indicate that they go together. We've tried this solution before, but it produces complicated cypher queries, and it makes sort of a mess when the same entity (especially tissues) appear in many hyperedges.

I'll update my text above

cbizon commented 3 years ago

Ha! OK, I thought that might be the case, but wasn't sure :)

colleenXu commented 3 years ago

I don't understand some of the text above. In the "We can approach this..." section, is "canonical" referring to 1 or 2? is "shortcut" referring to 2 or 3?

In my work trying to develop a model that presents associations as edges with edge properties....I've been doing situation 2. Under an optional edge property called context_relevance, there's a "dictionary" where there are one or more key-value pairs. The keys are the type of context/relevance that MUST be used in interpreting this association (taxon-specific, experimental-setup-specific, disease-specific, cohort-specific) and the values are lists of ontology terms/structured vocab terms that describe this context/relevance. For example, to say that this gene - GO-biological-process association was asserted using experiments in a specific human glioma cell line (in YAML):

## a specific edge has this property
context_relevance:
  experimental-setup-specific:
  - CLO:0001367

suihuang-ISB commented 3 years ago

I am glad and surprised to see this discussion here ... (sorry I am not a frequent participant of DATA MODELING meetings, so may restate earlier discussions). May I raise an issue here, ?
Given my long research in gene-regulatory-networks, I am wondering whether Chris M essentially says that we should try to emulate hypergraphs, using additional nodes and hyperedges and their membership in sets, to capture conditional interactions.

But then, the simple example (geneA-geneB, geneA-site, geneB-site) which may for instance mean that geneA upregulates geneB in site X, is only the beginning. In reality, geneA upregulates gene B only if {gene X1 is on, gene X2 is off, gene X3 is ON ..... X_N is on}. Thus this would be not only a N-hyperedge, but also would have to be specified by a boolean function as a property of that hyperedge. The condition {...} of course represents a gene expression profile, thus essentially a tissue in which "GeneA upregulates GeneB" is valid. But this is not simply question of granularity. The queston is: When do we stop if we enter this rabbit hole of conditional predicates, specified by hyperedges? One condition? Two conditions, N conditions? And how do we capture their logical relationship? And how do we store/access the table that tie the edges together to hyperedges? Is this information in he edge property? ... I fear a combinatorial explosion.

A practical example that we have encountered when accessing LINCS and cell line info, is the following: "DrugX activates GeneA in tissueY and in disease Z".

I guess this is not so much a question of HOW we capture things, but WHAT we are willing to capture in the Translator in the first place. Where do we draw the line? My impression has been that so far, that we have been avoiding hypergraphs in KGs... Is that correct? Or do we allow for 3-hyperedges (as indicated in the original title of this issue) but not higher (n) degrees? Or we can avoid hyperedges all together by introducing corresponding nodes... (e.g. as most prosaically represented by "reaction nodes" in biochemical reactions,...) Has there been an agreement on this?

sierra-moxon commented 1 year ago

I believe our recent work on Biolink 3 has helped us explore these ideas in the gene->chemical and gene->gene association space. In the recent release of Biolink (3.1.0) we've also expanded our documentation to provide predicate->qualifier-based-association mappings (examples/biolink3_migration/predicate_mapping.yaml) and more example transformations in the qualifier model. (see: https://biolink.github.io/biolink-model/guidelines/association-examples-with-qualifiers.html). I am going to close this as there do not seem to be additional action items to take and continue documenting these mappings as we refine the predicate hierarchy. (please of course reopen if this is insufficient).

colleenXu commented 1 year ago

@sierra-moxon On this webpage, https://biolink.github.io/biolink-model/guidelines/association-examples-with-qualifiers.html, the Association objects have typos in the "category" field.

For example, I see "category": "biolink:ChemicalAffectsGeneAsociation",. And it looks like "Asociation" is used there.

sierra-moxon commented 1 year ago

Thanks @colleenXu - the Asociation typo in the example file is fixed here: https://github.com/biolink/biolink-model/pull/1149

biolink / biolink-model

Document n-ary relationship design patterns #566