NCATSTranslator / ReasonerAPI

NCATS Biomedical Translator Reasoners Standard API
34 stars 28 forks source link

add support of collections #373

Open vdancik opened 2 years ago

vdancik commented 2 years ago

We should add support of collections in TRAPI by adding Boolean property is_set to KnowledgeGraph.Node and QueryGraph.QNode to indicate that a node represents a collection of entities rather then a single entity.

Since there already is is_set in QueryGraph.QNode with somewhat confusing meaning, we should also add collate to QueryGraph.QNode to indicate that nodes in results should be grouped.

edeutsch commented 2 years ago

Following up on today's discussion "use case 3" of collections and enrichment, I maintain that this problem was solved long ago and ARAX implements exactly this with existing TRAPI 1.3 and no change is needed. Here's my example query:

{
  "edges": {
    "e0": {
      "subject": "n0",
      "object": "n1"
    }
  },
  "nodes": {
    "n0": {
      "ids": [
        "UniProtKB:Q9BXW9",
        "UniProtKB:Q9NW38",
        "UniProtKB:Q9NPD8",
        "UniProtKB:Q9NVI1",
        "UniProtKB:Q9UI95",
        "UniProtKB:O15360"
      ],
      "is_set": true,
      "categories": [
        "biolink:Protein"
      ]
    },
    "n1": {
      "is_set": false,
      "categories": [
        "biolink:Disease"
      ]
    }
  }
}

Notably, is_set = true indicates that the list of ids should be treated as a group.

And here's the ARAX result for this query: https://arax.ncats.io/?r=64606

Each result is a disease that is highly connected to that list of proteins (not necesarily all). A higher fraction of that set causes results to bubble to the top, and more edges also cause higher ranking.

The set/collection for query is defined by the QNode.ids list and QNode.is_set=true The set/collection for the results is defined by the bindings in each Result between KG Nodes and the relevant QNode.

I think this is simple and logical and does everything we need.

andrewsu commented 2 years ago

After further thought, I think I agree with @edeutsch here. Originally I was thinking there were two use cases that should be handled separately -- for results merging and for enrichment-based associations. But the query behavior for both is the same, and the enrichment score can be reflected in the results scoring. So I'm on board with is_set already handling the use cases as I see them...

cbizon commented 2 years ago

is_set might be the answer, I agree. But I'm a little unsure how it works. I understand the example that @edeutsch posted above, but I don't really understand the behavior for something like this (@andrewsu is this what you meant by result merging?):

{
  "edges": {
    "e0": {
      "subject": "n0",
      "object": "n1"
    },
    "e1": { ...}
  },
  "nodes": {
    "n0": {
      "is_set": false,
      "categories": [
        "biolink:Disease"
      ],
     "ids": ["MONDO:1234"]
    }
    "n1": {
      "is_set": true,
      "categories": [
        "biolink:Protein"
      ]
    },
    "n2": {
      "is_set": false,
      "categories": [
        "biolink:Disease"
      ]
    }
  }
}
edeutsch commented 2 years ago

This is also valid, but a different use case than we were discussing. In this case, each Result is MONDO:1234 at one end and a disease at the other end, and then a set/collection of proteins that they share in common between them in the middle. Ranking should be something like the results with the most shared proteins appear highest, although there is plenty of room for improvements on the ranking that could take things like the quality of the edges, NGD between the two diseases, etc. into account as well.

cbizon commented 2 years ago

So in the case that I put above every element of n1 in an answer must be attached to both n0 and n2?

edeutsch commented 2 years ago

In the ARAX implementation currently, yes. I suppose there might be an opportunity for different implementations to include only partially connected nodes, although I wouldn't recommend it. Seems related to the whole "can you return partial paths" discussion, which I'm not certain we ever really resolved.

cbizon commented 2 years ago

So it seems like there is different behavior for the same construct? If it's a bound node then I do enrichment, but if it's an unbound node then I don't?

edeutsch commented 2 years ago

I don't think the behavior needs to be any different whether it is bound or unbound. I suppose it might be, as a refinement decided by the implementer, but I'm think it it would normally be the same.

cbizon commented 2 years ago

Sorry, I might be missing something, but is it up to the server to decide how to implement is_set? It might mean the fully connected, or it might mean partially connected, and that partially connected might mean enrichment or max connectedness, or other versions?

edeutsch commented 2 years ago

Until we decide that everyone has to do things the same way, I suppose we're all free to do things a bit differently. Aragorn is doing a whole lot of things differently than ARAX. Our current definition for is_set is this:

        is_set:
          type: boolean
          description: >-
            Boolean that if set to true, indicates that this QNode MAY have
            multiple KnowledgeGraph Nodes bound to it within each Result.
            The nodes in a set should be considered as a set of independent
            nodes, rather than a set of dependent nodes, i.e., the answer
            would still be valid if the nodes in the set were instead returned
            individually. Multiple QNodes may have is_set=True. If a QNode
            (n1) with is_set=True is connected to a QNode (n2) with
            is_set=False, each n1 must be connected to n2. If a QNode (n1)
            with is_set=True is connected to a QNode (n2) with is_set=True,
            each n1 must be connected to at least one n2.

So a strict reading means to me that partial connectedness is not permitted (contrary to what I supposed above).

It stipulates nothing about how ranking should done, and I'm sure there is diversity in ideas on how ranking is best done in cases like this from enrichment to max connectedness. So until we stipulate how it must be done, there can be diversity.

vdancik commented 9 months ago

Example of a KG with a gene set:

{
    "knowledge_graph": {
        "nodes": {
            "NCBIGene:10000": {
                "categories": [
                    "biolink:Gene"
                ],
                "name": "AKT3"
            },
            "NCBIGene:10097": {
                "categories": [
                    "biolink:Gene"
                ],
                "name": "ACTR2"
            },
            "NCBIGene:10111": {
                "categories": [
                    "biolink:Gene"
                ],
                "name": "RAD50"
            },
            "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f": {
                "categories": [
                    "biolink:Gene"
                ],
                "name": "AKT3,ACTR2,RAD50",
                "is_set": true
            },
            "MSigDB:HALLMARK_GLYCOLYSIS": {
                "categories": [
                    "biolink:Pathway"
                ],
                "name": "HALLMARK_GLYCOLYSIS"
            }
        },
        "edges": {
            "e0-fBufztAzDx": {
                "subject": "NCBIGene:10000",
                "predicate": "biolink:member_of",
                "object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
                "sources": [
                    {
                        "resource_id": "infores:gelinea",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:molepro",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": [
                            "infores:gelinea"
                        ]
                    }
                ]
            },
            "e1-fBufztAzDx": {
                "subject": "NCBIGene:10097",
                "predicate": "biolink:member_of",
                "object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
                "sources": [
                    {
                        "resource_id": "infores:gelinea",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:molepro",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": [
                            "infores:gelinea"
                        ]
                    }
                ]
            },
            "e2-fBufztAzDx": {
                "subject": "NCBIGene:10111",
                "predicate": "biolink:member_of",
                "object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
                "sources": [
                    {
                        "resource_id": "infores:gelinea",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:molepro",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": [
                            "infores:gelinea"
                        ]
                    }
                ]
            },
            "e3-fBufztAzDx": {
                "subject": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
                "predicate": "biolink:enriched_in",
                "object": "MSigDB:HALLMARK_GLYCOLYSIS",
                "sources": [
                    {
                        "resource_id": "infores:gelinea",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:molepro",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": [
                            "infores:gelinea"
                        ]
                    }
                ]
            }
        }
    }
}
edeutsch commented 9 months ago

Here's my graphical representation of what I think the proposal is. Is this right @vdancik ?

image

vdancik commented 9 months ago

Example query with an is_set flag:

{
    "message": {
        "query_graph": {
            "nodes": {
                "pathway": {
                    "categories": [
                        "biolink:Pathway"
                    ]
                },
                "gene": {
                    "ids": [
                        "NCBIGene:10000",
                        "NCBIGene:10097",
                        "NCBIGene:10111"
                    ],
                    "is_set": true
                }
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "pathway",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "knowledge_type":"inferred"
                }
            }
        }
    }
}

would result in a following result

{
    "results": [
        {
            "analyses": [
                {
                    "edge_bindings": {
                        "gene": [
                            {
                                "id": "e3-fBufztAzDx"
                            }
                        ]
                    },
                    "resource_id": "infores:gelinea",
                    "support_graphs": [
                        "gene_set_aux_graph"
                    ]
                }
            ],
            "node_bindings": {
                "gene": [
                    {
                        "id": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f"
                        "is_set": true
                    },
                    {
                        "id": "NCBIGene:10000",
                        "query_id": "NCBIGene:10000"
                    },
                    {
                        "id": "NCBIGene:10097",
                        "query_id": "NCBIGene:10097"
                    },
                    {
                        "id": "NCBIGene:10111",
                        "query_id": "NCBIGene:10111"
                    }
                ],
                "pathway": [
                    {
                        "id": "MSigDB:HALLMARK_GLYCOLYSIS"
                    }
                ]
            }
        }
    ]
}

where as auxiliary graph is

{
    "auxiliary_graphs": {
        "gene_set_aux_graph": {
            "edges": [
                "e0-fBufztAzDx",
                "e1-fBufztAzDx",
                "e2-fBufztAzDx"
            ]
        }
    }
}

and a KG is in my previous comment

edeutsch commented 9 months ago

So here is a slight update to the picture based on today's discussion. The query predicate is updated. And I depicted Result #1 as one that contains all 5 input genes, but Result #2 is the next best match where 3 of the 5 match.

image

There was some discussion of whether this means AND or OR. or a "soft AND", i.e. "as many as possible". I am thinking that the is_set=true construction is interpreted to mean "as many of the set as possible". More members would mean a higher rank. But sets that don't contain all members are not automatically discarded. But maybe this is not the desired outcome.

Additional note: In this scenario, the Query must have knowledge_type: inferred (i.e. "creative mode")

How is this different from the sort of thing that COHD already does?

edeutsch commented 8 months ago

We should probably document why this isn't good enough:

image

Can we capture all the enrichment statistical metrics in each Result.Analysis.attributes[]?

The query predicate "related_to" is tripping us up here. Better to consider a query predicate like "enriched_in" (*does not actually exist yet). Or "participates_in"?

TereseCamp commented 8 months ago

I have added an “alternate Support Graph” item #5 with the link to the agenda for today’s meeting. Not sure if this was what you wanted, so please let me know if we need other actions.

Terese Camp, PMP Research Project Manager Renaissance Computing Institute (RENCI) University of North Carolina at Chapel Hill @.**@.>

From: Eric Deutsch @.> Date: Wednesday, January 17, 2024 at 11:46 PM To: NCATSTranslator/ReasonerAPI @.> Cc: Camp, Terese @.>, Assign @.> Subject: Re: [NCATSTranslator/ReasonerAPI] add support of collections (Issue #373) You don't often get email from @.*** Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification

We should probably document why this isn't good enough:

image.png (view on web)https://github.com/NCATSTranslator/ReasonerAPI/assets/12707718/6266a81e-40cd-44df-9b67-8386d592ee22

— Reply to this email directly, view it on GitHubhttps://github.com/NCATSTranslator/ReasonerAPI/issues/373#issuecomment-1897789358, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BB5G3HIHAM745ABPTAWXFTDYPCSJPAVCNFSM6AAAAAAQL3GR5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJXG44DSMZVHA. You are receiving this because you were assigned.Message ID: @.***>