Entity sets - Githubissues

patrickkwang commented 3 years ago

We want to represent sets of entities (e.g. gene sets) in the Biolink Model. This is motivated by https://github.com/NCATS-Tangerine/NCATS-ReasonerStdAPI/issues/57. This may be related to #348.

We have two ideas for how this could be done.

An attribute is_set that can be applied to any biological entity
An extra X set class for each biological entity X (or some subset of them)

@edeutsch @vdancik @cmungall

kshefchek commented 3 years ago

If the sequence set has an identifier, we could model it as a class with the mapping: sequence feature set

GENO has allele sets but not gene sets, but I assume this could be added.

It's possibly a misuse of the class since sequence feature set is typically in relation to some genome(s), whereas I'm guessing this in relation to some analysis that generates a set of genes.

Gene set would also be useful for gene families (eg from Panther) EDIT: nevermind, gene family is already in the model

cbizon commented 3 years ago

@patrickkwang how is the ReasonerAPI planning to use this? As far as I can imagine, this would be used only for query_graphs. Is that correct? Or is the idea that a single node in a knowledge_graph or answer would represent a set of e.g. genes?

patrickkwang commented 3 years ago

With is_set I think this will only be used in the query graph. With the X set type, I'm not sure. Maybe @vdancik has a better feel for how the answers would be represented in that case?

edeutsch commented 3 years ago

One concern is if BioLink entities must have identifiers? Arbitrary sets of proteins won't have identifiers. Is that a problem? Sure, you could consider a pathway as a set of proteins and that would have an identifier. But the way we're considering it, we are talking about sets of things (e.g. proteins or drugs or diseases) that wouldn't have an inherent identifier. It's just a set. Is that an obstacle to making them a BioLink class?

vdancik commented 3 years ago

I can see entity sets used in both query and response graphs. Actually, if query graph specifies it then result graph must have it as well, right?

edeutsch commented 3 years ago

In our usage, is_set is only used in the QueryGraph QNodes. In the KnowledgeGraph the Nodes are all individual (because one node may belong to multiple sets). The sets are denoted by which are grouped together in the NodeBindings within each Result. For one Result, there will be multiple Nodes bound to one QNode (with is_set=true) with NodeBinding. A different Result may bind to some of the same Nodes as other Results. Therefore is_set=true is only needed in QNode as a hint to the reasoner that multiple Nodes should be bound to it in each Result.

patrickkwang commented 3 years ago

@vdancik, how would the set nodes be used in the results? Would there have to be a CURIE bound to the set node? How would we denote which genes were part of the set? Would those be bound to something in the query graph?

cbizon commented 3 years ago

@patrickkwang maybe you could put up an example of how we used sets in robokop?

edeutsch commented 3 years ago

FWIW, here's an example of how ARAX is using it: https://arax.rtx.ai/beta/?m=2648 This is a query that is basically (DOID:14330)---(protein,is_set=true)---(chemical_substance) This basically means that we're looking for chemical_substances that share sets of proteins with Parkinson Disease.

cbizon commented 3 years ago

I clicked the link but maybe I'm doing something weird b/c I don't see any query or answers?

In robokop I think we're using it similarly though. The important point in my mind is that the set is only part of the question as an annotation. It doesn't show up in either the knowledge_graph or results components of the message. I think that means that id doesn't require much from biolink model.

ehinderer commented 3 years ago

@cbizon You have to click on "messages" on the left side to see the query details.

@edeutsch Does this query imply that every chemical substance returned must target >1 protein associated with Parkinson's disease?

edeutsch commented 3 years ago

not necessarily, you can have a set of one protein. But sets with more members are ranked higher in our algorithm.

ehinderer commented 3 years ago

This is the form I would expect if I made a generic query for (disease)-->(protein)-->(chemical); that is, I would expect the ARA to find the set of proteins associated with the disease, and then for each protein, find the set of chemicals that target that protein and then return the graph interlinking everything.

It's nice to have additional constraints and values for ranking, but that is the basic expectation I would have for the system. I would even go so far as to say that is_set = True should be the default setting for queries without specific curies. I can't think of a reason why a user would only want one result returned from a query if more were available. But perhaps I'm misunderstanding something.

patrickkwang commented 3 years ago

The way that Robokop does this, you get all of the data regardless of is_set. That is, is_set does not impact the knowledge_graph, it just impacts how everything is bound in results. If you were to set is_set=True everywhere, you would get just one result binding everything, while where is_set==False, there may be only one binding per result. You still get a list of results encompassing everything in the knowledge graph.

edeutsch commented 3 years ago

and stated another way, if protein has is_set=false, then the results are all possible paths through the knowledge_graph, rather than the non-redundant set of chemical_substances with the sets of proteins that link them.

It is true that the rationale for wanting is_set=false in this context seems weak, so the default assumption is probably that the user would want protein is_set=true. Can the system just always assume it and we don't need to specify it? maybe..

nlharris commented 2 years ago

What is the status of this?

cmungall commented 2 years ago

I think this can be closed? Either what you have is a biological entity like a pathway or phenotype which is linked to a set of genes, in which case you are covered, or you literally want to represent a list of Xs, which is a feature of the representation language eg json

On Thu, Aug 26, 2021, 15:15 Nomi Harris @.***> wrote:

What is the status of this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biolink/biolink-model/issues/385#issuecomment-906779533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOMVFEVW2IR327A2YYTT624HJANCNFSM4OV6A52A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

vdancik commented 2 years ago

@nlharris, @cmungall, I don't think this question was addressed. We have no means, in general, to assign a Biolink class to a node representing a collection of entities with a set of genes represented by a pathway being the only exception.

One possibility would be to borrow from java and use [] after Biolink class e.g. "biolink.MolecularEntity[]" to indicate that a node represents a set.

cmungall commented 2 years ago

One possibility would be to borrow from java and use [] after Biolink class e.g. "biolink.MolecularEntity[]" to indicate that a node represents a set.

Not clear exactly what you are suggesting and where this string would be manifested. In trapi? in kgx?

But I think this is kind of what I was getting at with:

you literally want to represent a list of Xs, which is a feature of the representation language eg json

the job of biolink is to represent the main biological types, not abstractions like sets, ordered lists, dicts, hashtables, etc.

this doesn't constrain you from talking about lists of molecular entities, it's just not the concern of the model. You can exchange a list of molecular entities in json with a json list.

however, if you have a use case where the list is a first-class entity in itself, needing an identifier, is passed around in trapi messages, then we could explore adding abstractions like collections into the model

collection
- set
- bag/multiset
- list
  - simple list
  - list of lists
  - list of magnitude-entity tuples (e.g. for representing input to GSEA)
- dict
- ...

The collection class would have a slot member_type that would point to a category, allowing lists at varying levels of granularity. E.g. collections that are mixtures of molecular entities, diseases, etc; more restricted collections that only consist of genes, etc

I am not sure this is a good idea though. For Translator, would ARAs know how to operate on these? Would we expect the generic operations to do the right thing.

Another approach would be create a mixin hierarchy shadowing the main hierarchy

collection of entities
- collection of molecular entities
  - ...

Existing classes could mix these in. E.g. a pathway has the trait of being a collection of genes; and also a collection of molecular entities (essentially all the things that can be the object of a predicate where the subject is a pathway); a gene also would have the trait of being a collection of genes (e.g. the set of genes that interact)

If you truly have a node that has edges to genes but can't say what kind of thing that node is then you could just directly type with 'collection of genes', and have a generic related_to linking to the genes

vdancik commented 2 years ago

I agree that the goal is to represent main biological types and there is no need to include a collection hierarchy to the model. What I am proposing would be a convention when using the Biolink model rather then an addition to or change of the model. I envision using that in both TRAPI and KGX.

saramsey commented 2 years ago

It think it is important to maintain alignment between TRAPI JSON and other KG representations, such as KGX TSV. I think it would be unfortunate if there were type semantics introduced into TRAPI that could not be represented in KGX TSV (not claiming that this is the intention here).

As I understand the KGX-TSV format spec, currently, the category of a "node" in KGX TSV is a pipe character- (|)-delimited list of CURIEs, each of which must be a Biolink descendant of NamedThing, right?

As I understand the TRAPI spec, currently, the categories field of the Node class in TRAPI has a type annotation of [BiolinkEntity], where BiolinkEntity is a regex that doesn't allow square brackets and is supposed to contain NamedThing descendant types.

So, I think it is not enough to just extend the type of the categories field of the Node class in TRAPI, either using the existing TRAPI types or an extension of the current TRAPI types. We would also need to extend the spec for KGX TSV somehow, because currently, it says that category is a list of NamedThings, right?

Hope that makes sense.

nlharris commented 1 year ago

Is this still in progress?

biolink / biolink-model

Entity sets #385