Open vdancik opened 2 years ago
Following up on today's discussion "use case 3" of collections and enrichment, I maintain that this problem was solved long ago and ARAX implements exactly this with existing TRAPI 1.3 and no change is needed. Here's my example query:
{
"edges": {
"e0": {
"subject": "n0",
"object": "n1"
}
},
"nodes": {
"n0": {
"ids": [
"UniProtKB:Q9BXW9",
"UniProtKB:Q9NW38",
"UniProtKB:Q9NPD8",
"UniProtKB:Q9NVI1",
"UniProtKB:Q9UI95",
"UniProtKB:O15360"
],
"is_set": true,
"categories": [
"biolink:Protein"
]
},
"n1": {
"is_set": false,
"categories": [
"biolink:Disease"
]
}
}
}
Notably, is_set = true indicates that the list of ids should be treated as a group.
And here's the ARAX result for this query: https://arax.ncats.io/?r=64606
Each result is a disease that is highly connected to that list of proteins (not necesarily all). A higher fraction of that set causes results to bubble to the top, and more edges also cause higher ranking.
The set/collection for query is defined by the QNode.ids list and QNode.is_set=true The set/collection for the results is defined by the bindings in each Result between KG Nodes and the relevant QNode.
I think this is simple and logical and does everything we need.
After further thought, I think I agree with @edeutsch here. Originally I was thinking there were two use cases that should be handled separately -- for results merging and for enrichment-based associations. But the query behavior for both is the same, and the enrichment score can be reflected in the results scoring. So I'm on board with is_set
already handling the use cases as I see them...
is_set might be the answer, I agree. But I'm a little unsure how it works. I understand the example that @edeutsch posted above, but I don't really understand the behavior for something like this (@andrewsu is this what you meant by result merging?):
{
"edges": {
"e0": {
"subject": "n0",
"object": "n1"
},
"e1": { ...}
},
"nodes": {
"n0": {
"is_set": false,
"categories": [
"biolink:Disease"
],
"ids": ["MONDO:1234"]
}
"n1": {
"is_set": true,
"categories": [
"biolink:Protein"
]
},
"n2": {
"is_set": false,
"categories": [
"biolink:Disease"
]
}
}
}
This is also valid, but a different use case than we were discussing. In this case, each Result is MONDO:1234 at one end and a disease at the other end, and then a set/collection of proteins that they share in common between them in the middle. Ranking should be something like the results with the most shared proteins appear highest, although there is plenty of room for improvements on the ranking that could take things like the quality of the edges, NGD between the two diseases, etc. into account as well.
So in the case that I put above every element of n1 in an answer must be attached to both n0 and n2?
In the ARAX implementation currently, yes. I suppose there might be an opportunity for different implementations to include only partially connected nodes, although I wouldn't recommend it. Seems related to the whole "can you return partial paths" discussion, which I'm not certain we ever really resolved.
So it seems like there is different behavior for the same construct? If it's a bound node then I do enrichment, but if it's an unbound node then I don't?
I don't think the behavior needs to be any different whether it is bound or unbound. I suppose it might be, as a refinement decided by the implementer, but I'm think it it would normally be the same.
Sorry, I might be missing something, but is it up to the server to decide how to implement is_set? It might mean the fully connected, or it might mean partially connected, and that partially connected might mean enrichment or max connectedness, or other versions?
Until we decide that everyone has to do things the same way, I suppose we're all free to do things a bit differently. Aragorn is doing a whole lot of things differently than ARAX. Our current definition for is_set is this:
is_set:
type: boolean
description: >-
Boolean that if set to true, indicates that this QNode MAY have
multiple KnowledgeGraph Nodes bound to it within each Result.
The nodes in a set should be considered as a set of independent
nodes, rather than a set of dependent nodes, i.e., the answer
would still be valid if the nodes in the set were instead returned
individually. Multiple QNodes may have is_set=True. If a QNode
(n1) with is_set=True is connected to a QNode (n2) with
is_set=False, each n1 must be connected to n2. If a QNode (n1)
with is_set=True is connected to a QNode (n2) with is_set=True,
each n1 must be connected to at least one n2.
So a strict reading means to me that partial connectedness is not permitted (contrary to what I supposed above).
It stipulates nothing about how ranking should done, and I'm sure there is diversity in ideas on how ranking is best done in cases like this from enrichment to max connectedness. So until we stipulate how it must be done, there can be diversity.
Example of a KG with a gene set:
{
"knowledge_graph": {
"nodes": {
"NCBIGene:10000": {
"categories": [
"biolink:Gene"
],
"name": "AKT3"
},
"NCBIGene:10097": {
"categories": [
"biolink:Gene"
],
"name": "ACTR2"
},
"NCBIGene:10111": {
"categories": [
"biolink:Gene"
],
"name": "RAD50"
},
"UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f": {
"categories": [
"biolink:Gene"
],
"name": "AKT3,ACTR2,RAD50",
"is_set": true
},
"MSigDB:HALLMARK_GLYCOLYSIS": {
"categories": [
"biolink:Pathway"
],
"name": "HALLMARK_GLYCOLYSIS"
}
},
"edges": {
"e0-fBufztAzDx": {
"subject": "NCBIGene:10000",
"predicate": "biolink:member_of",
"object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
"sources": [
{
"resource_id": "infores:gelinea",
"resource_role": "primary_knowledge_source"
},
{
"resource_id": "infores:molepro",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:gelinea"
]
}
]
},
"e1-fBufztAzDx": {
"subject": "NCBIGene:10097",
"predicate": "biolink:member_of",
"object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
"sources": [
{
"resource_id": "infores:gelinea",
"resource_role": "primary_knowledge_source"
},
{
"resource_id": "infores:molepro",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:gelinea"
]
}
]
},
"e2-fBufztAzDx": {
"subject": "NCBIGene:10111",
"predicate": "biolink:member_of",
"object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
"sources": [
{
"resource_id": "infores:gelinea",
"resource_role": "primary_knowledge_source"
},
{
"resource_id": "infores:molepro",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:gelinea"
]
}
]
},
"e3-fBufztAzDx": {
"subject": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
"predicate": "biolink:enriched_in",
"object": "MSigDB:HALLMARK_GLYCOLYSIS",
"sources": [
{
"resource_id": "infores:gelinea",
"resource_role": "primary_knowledge_source"
},
{
"resource_id": "infores:molepro",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:gelinea"
]
}
]
}
}
}
}
Here's my graphical representation of what I think the proposal is. Is this right @vdancik ?
Example query with an is_set
flag:
{
"message": {
"query_graph": {
"nodes": {
"pathway": {
"categories": [
"biolink:Pathway"
]
},
"gene": {
"ids": [
"NCBIGene:10000",
"NCBIGene:10097",
"NCBIGene:10111"
],
"is_set": true
}
},
"edges": {
"t_edge": {
"object": "gene",
"subject": "pathway",
"predicates": [
"biolink:related_to"
],
"knowledge_type":"inferred"
}
}
}
}
}
would result in a following result
{
"results": [
{
"analyses": [
{
"edge_bindings": {
"gene": [
{
"id": "e3-fBufztAzDx"
}
]
},
"resource_id": "infores:gelinea",
"support_graphs": [
"gene_set_aux_graph"
]
}
],
"node_bindings": {
"gene": [
{
"id": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f"
"is_set": true
},
{
"id": "NCBIGene:10000",
"query_id": "NCBIGene:10000"
},
{
"id": "NCBIGene:10097",
"query_id": "NCBIGene:10097"
},
{
"id": "NCBIGene:10111",
"query_id": "NCBIGene:10111"
}
],
"pathway": [
{
"id": "MSigDB:HALLMARK_GLYCOLYSIS"
}
]
}
}
]
}
where as auxiliary graph is
{
"auxiliary_graphs": {
"gene_set_aux_graph": {
"edges": [
"e0-fBufztAzDx",
"e1-fBufztAzDx",
"e2-fBufztAzDx"
]
}
}
}
and a KG is in my previous comment
So here is a slight update to the picture based on today's discussion. The query predicate is updated. And I depicted Result #1 as one that contains all 5 input genes, but Result #2 is the next best match where 3 of the 5 match.
There was some discussion of whether this means AND or OR. or a "soft AND", i.e. "as many as possible". I am thinking that the is_set=true construction is interpreted to mean "as many of the set as possible". More members would mean a higher rank. But sets that don't contain all members are not automatically discarded. But maybe this is not the desired outcome.
Additional note: In this scenario, the Query must have knowledge_type: inferred (i.e. "creative mode")
How is this different from the sort of thing that COHD already does?
We should probably document why this isn't good enough:
Can we capture all the enrichment statistical metrics in each Result.Analysis.attributes[]?
The query predicate "related_to" is tripping us up here. Better to consider a query predicate like "enriched_in" (*does not actually exist yet). Or "participates_in"?
I have added an “alternate Support Graph” item #5 with the link to the agenda for today’s meeting. Not sure if this was what you wanted, so please let me know if we need other actions.
Terese Camp, PMP Research Project Manager Renaissance Computing Institute (RENCI) University of North Carolina at Chapel Hill @.**@.>
From: Eric Deutsch @.> Date: Wednesday, January 17, 2024 at 11:46 PM To: NCATSTranslator/ReasonerAPI @.> Cc: Camp, Terese @.>, Assign @.> Subject: Re: [NCATSTranslator/ReasonerAPI] add support of collections (Issue #373) You don't often get email from @.*** Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification
We should probably document why this isn't good enough:
image.png (view on web)https://github.com/NCATSTranslator/ReasonerAPI/assets/12707718/6266a81e-40cd-44df-9b67-8386d592ee22
— Reply to this email directly, view it on GitHubhttps://github.com/NCATSTranslator/ReasonerAPI/issues/373#issuecomment-1897789358, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BB5G3HIHAM745ABPTAWXFTDYPCSJPAVCNFSM6AAAAAAQL3GR5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJXG44DSMZVHA. You are receiving this because you were assigned.Message ID: @.***>
We should add support of collections in TRAPI by adding Boolean property
is_set
toKnowledgeGraph.Node
andQueryGraph.QNode
to indicate that a node represents a collection of entities rather then a single entity.Since there already is
is_set
inQueryGraph.QNode
with somewhat confusing meaning, we should also addcollate
toQueryGraph.QNode
to indicate that nodes in results should be grouped.