Open RichardBruskiewich opened 2 years ago
relevant:
@putmantime @RichardBruskiewich - is this a blocker for Monarch?
I've just started working on the Gene Orthology ingest into Monarch (using Panther data) so it is a kind of a blocker.
Let's also discuss panther.node IDs here as well. These would be related transitively via evolutionary descent, and horizontally via homology relations. This is in line with the GO/Panther interpretation.
As for the relationship between genes, proteins, nodes, and families: member-of is a good name for this relation but RO has quite strict semantics here.
When choosing any term (class or relation) it's always a good idea to check the hierarchy:
http://purl.obolibrary.org/obo/RO_0002350
RO has a strict mereological view of membership, there isn't really a physical structure existing in space that is the collection of all present and ancestral SHH genes for example.
There is an argument to be made for using subclass_of - this would be consistent with treating PRO family level terms as equivalent to panther families, and also works for relating to subfamilies as well.
Gene orthology knowledge curation (i.e. from the [Panther database]()) relates gene instances to gene families. It's relatively easy to infer guess what concept nodes need to be captured, i.e.
biolink:Gene
andbiolink:GeneFamily
(to start).However, the
biolink:GeneFamily
concept currently seems a bit disconnected from anybiolink:Association
class.Do we need to define a new
biolink:GeneToGeneFamilyAssociation
or alternately, would we simply just connectbiolink:Gene
to theirbiolink:GeneFamily
using thebiolink:has_attribute
slot? I guess it depends on the use cases and to the extent thatbiolink:GeneFamily
instances are annotated with links and related data.As first class nodes, perhaps such annotation would be easily available in the knowledge graph. On the other hand, it may simply suffice to tag the
biolink:Gene
with the gene family identifier (e.g. PANTHER.FAMILY curie) then expect end users to access the related knowledge from outside of the graph (e.g. via a link in the UI?).That said, reasoning over gene families likely involves transferring (GO term?) molecular function, biological process and cellular component inferences across species boundaries (e.g. from genes in model species to human genes). Graph reasoning engines (e.g. TRAPI wrapped?) might find this task easier if the
biolink:Gene
tobiolink:GeneFamily
relationship is modelled with first class knowledge graph nodes and edges (i.e. abiolink:Association
). Also,biolink:GeneFamily
instances may have subclassing hierarchical (i.e. subfamily) relationships to one another, thus,biolink:GeneFamily
tobiolink:GeneFamily
instances ofbiolink:Association
may also be posited.One counterpoint argument is simply that such edges may needlessly(?) clutter up the graph somewhat with as many additional edges as there are genes. That said, some knowledge graphs may have use cases supported by such knowledge representations.
Assuming the latter situation,
biolink:GeneFamily
instances will be documented as first class concept nodes, and perhaps, we would add a newbiolink:GeneToGeneFamilyAssociation
class to assert set membership of genes into such families.The next question to arise is which
biolink:predicate
should be used in such associations?The
biolink:related_to
seems a bit too general.A
biolink:GeneFamily
could be construed as a kind of conceptual grouping of genes. This suggests that these are associations anchored onbiolink:related_to_at_concept_level
or perhaps, one of the child predicates -biolink:narrow_match
orbiolink:subclass_of
could be applied - but perhaps these predicates don't quite seem totally appropriate.Within the
biolink:related_to_at_instance_level
predicate space, some terms could apply if their English language definition is loosely assumed, but the strict Biolink Model scoping of the definitions of most (all?) such terms seem to exclude them from consideration. For example, the definition ofbiolink:part_of
says "...holds between parts and wholes (material entities or processes)...". but a gene family is not really a "whole" of a material entity or process.Perhaps another more fruitful perspective is to image that
biolink:related_to_at_concept_level
is still an appropriate space within which a suitable predicate should be found, and that one major aspect of the "concept level" space is set theoretic in nature. For example,biolink:narrow_match
orbiolink:subclass_of
define subsets of a conceptual space based on specific attributes.However, in set theory, one also has the parallel concept of set membership. Perhaps what is needed in the Biolink Model are simple predicates for set membership. In fact, RO already has mappings to such terms. They are:
member_of
(RO:0002350)has_member
(RO:0002351)These do have the unusual characteristic of spanning the conceptual and instance spaces, in that a set is conceptual but can obviously aggregate instances. That said, adding them as child predicates under
biolink:related_to_at_concept_level
could be helpful to the current use case of predicates appropriate to modelbiolink:Gene
tobiolink:GeneFamily
andbiolink:GeneFamily
tobiolink:GeneFamily
relationships (although the latter relationship could still perhaps be modelled as abiolink:subclass_of
relationship?)What working group (or team) did this request originate from?
The need for this change originates from the Monarch Initiative but is also likely needed for future iterations of the SRI Reference Graph (which is still essentially a derivative of the Monarch knowledge graph).