biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

removing duplicate IDs from sub-queries #332

Closed colleenXu closed 2 years ago

colleenXu commented 2 years ago

When BTE receives a query with multiple IDs on the same query-node AND those IDs actually resolve to the same entity....BTE seems to act as if it has "multiple identical starting nodes" when it generates its sub-queries...This leads to repetitive / less-efficient behavior...

Desired behavior: If entities for a query-node have identical ID-resolution objects, they should be merged / duplicates removed.

EDIT: this came up because some Translator groups are sending long lists of IDs on their q-nodes that all resolve to the same thing...see the query here. They are likely doing some kind of ID resolution first and then sending their queries to KPs (that don't do ID resolution) and ARAs (that do....)


For example, this query uses 3 IDs that currently resolve to the same entity (Barrett's esophagus):

query here ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": [ "MONDO:0013662", "DOID:9206", "UMLS:C0004763" ], "categories": [ "biolink:Disease" ] }, "n1": { "categories": [ "biolink:Gene" ] } }, "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": [ "biolink:condition_associated_with_gene" ] } } } } } ```

BTE is then sending sub-queries that have the same ID repeated (since all 3 starting IDs resolve to the exact same thing...):

Example where BTE sends an identical GET query 3 times ``` { "timestamp": "2021-10-21T20:18:57.603Z", "level": "DEBUG", "message": "call-apis: Succesfully made the following query: {\"url\":\"https://api.monarchinitiative.org/api/bioentity/disease/MONDO:0013662/genes\",\"params\":{\"direct\":true,\"rows\":200,\"unselect_evidence\":true},\"method\":\"get\",\"timeout\":50000}", "code": null }, { "timestamp": "2021-10-21T20:18:57.618Z", "level": "DEBUG", "message": "call-apis: After transformation, BTE is able to retrieve 25 hits!", "code": null }, { "timestamp": "2021-10-21T20:18:57.653Z", "level": "DEBUG", "message": "call-apis: Succesfully made the following query: {\"url\":\"https://api.monarchinitiative.org/api/bioentity/disease/MONDO:0013662/genes\",\"params\":{\"direct\":true,\"rows\":200,\"unselect_evidence\":true},\"method\":\"get\",\"timeout\":50000}", "code": null }, { "timestamp": "2021-10-21T20:18:57.668Z", "level": "DEBUG", "message": "call-apis: After transformation, BTE is able to retrieve 25 hits!", "code": null }, { "timestamp": "2021-10-21T20:18:57.682Z", "level": "DEBUG", "message": "call-apis: Succesfully made the following query: {\"url\":\"https://api.monarchinitiative.org/api/bioentity/disease/MONDO:0013662/genes\",\"params\":{\"direct\":true,\"rows\":200,\"unselect_evidence\":true},\"method\":\"get\",\"timeout\":50000}", "code": null }, { "timestamp": "2021-10-21T20:18:57.712Z", "level": "DEBUG", "message": "call-apis: After transformation, BTE is able to retrieve 25 hits!", "code": null }, ```
Example with a Biothings API (DOID:9206 repeated) ``` { "timestamp": "2021-10-21T20:18:57.521Z", "level": "DEBUG", "message": "call-apis: Succesfully made the following query: {\"url\":\"https://biothings.ncats.io/DISEASES/query\",\"params\":{\"fields\":\"DISEASES.associatedWith\",\"size\":\"1000\"},\"data\":\"q=DOID:9206,DOID:9206,DOID:9206&scopes=DISEASES.doid\",\"method\":\"post\",\"timeout\":50000}", "code": null }, ```
Example with a TRAPI API ``` { "timestamp": "2021-10-21T20:18:58.133Z", "level": "DEBUG", "message": "call-apis: Succesfully made the following query: {\"url\":\"https://automat.renci.org/hetio/1.2/query\",\"data\":{\"message\":{\"query_graph\":{\"nodes\":{\"n0\":{\"ids\":[\"MONDO:0013662\",\"MONDO:0013662\",\"MONDO:0013662\"],\"categories\":[\"biolink:Disease\"]},\"n1\":{\"categories\":[\"biolink:Gene\"]}},\"edges\":{\"e01\":{\"subject\":\"n0\",\"object\":\"n1\",\"predicates\":[\"biolink:condition_associated_with_gene\"]}}}},\"submitter\":\"infores:bte\"},\"method\":\"post\",\"timeout\":10000,\"headers\":{\"Content-Type\":\"application/json\"}}", "code": null }, ```
colleenXu commented 2 years ago

Note that this doesn't happen for BTE when the exact same ID is repeated on a query-node:

Example query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "MONDO:0013662",
                        "MONDO:0013662",
                        "MONDO:0013662"
                    ],
                    "categories": [
                        "biolink:Disease"
                    ]
                },
                "n1": {
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": [
                        "biolink:condition_associated_with_gene"
                    ]
                }
            }
        }
    }
}

Example of the logs w/o the repeating IDs issue:

        {
            "timestamp": "2021-10-21T20:30:21.343Z",
            "level": "DEBUG",
            "message": "call-apis: Succesfully made the following query: {\"url\":\"https://biothings.ncats.io/DISEASES/query\",\"params\":{\"fields\":\"DISEASES.associatedWith\",\"size\":\"1000\"},\"data\":\"q=DOID:9206&scopes=DISEASES.doid\",\"method\":\"post\",\"timeout\":50000}",
            "code": null
        },
        {
            "timestamp": "2021-10-21T20:30:21.355Z",
            "level": "DEBUG",
            "message": "call-apis: After transformation, BTE is able to retrieve 18 hits!",
            "code": null
        },
        {
            "timestamp": "2021-10-21T20:30:21.398Z",
            "level": "DEBUG",
            "message": "call-apis: Succesfully made the following query: {\"url\":\"https://api.monarchinitiative.org/api/bioentity/disease/MONDO:0013662/genes\",\"params\":{\"direct\":true,\"rows\":200,\"unselect_evidence\":true},\"method\":\"get\",\"timeout\":50000}",
            "code": null
        },
        {
            "timestamp": "2021-10-21T20:30:21.414Z",
            "level": "DEBUG",
            "message": "call-apis: After transformation, BTE is able to retrieve 25 hits!",
            "code": null
        },
        {
            "timestamp": "2021-10-21T20:30:21.825Z",
            "level": "DEBUG",
            "message": "call-apis: Succesfully made the following query: {\"url\":\"https://automat.renci.org/hetio/1.2/query\",\"data\":{\"message\":{\"query_graph\":{\"nodes\":{\"n0\":{\"ids\":[\"MONDO:0013662\"],\"categories\":[\"biolink:Disease\"]},\"n1\":{\"categories\":[\"biolink:Gene\"]}},\"edges\":{\"e01\":{\"subject\":\"n0\",\"object\":\"n1\",\"predicates\":[\"biolink:condition_associated_with_gene\"]}}}},\"submitter\":\"infores:bte\"},\"method\":\"post\",\"timeout\":10000,\"headers\":{\"Content-Type\":\"application/json\"}}",
            "code": null
        },
        {
            "timestamp": "2021-10-21T20:30:21.827Z",
            "level": "DEBUG",
            "message": "call-apis: After transformation, BTE is able to retrieve 24 hits!",
            "code": null
        },
        {
            "timestamp": "2021-10-21T20:30:21.827Z",
            "level": "DEBUG",
            "message": "call-apis: Total number of results returned for this query is 67",
            "code": null
        },
colleenXu commented 2 years ago

Another example query with IDs that actually resolve to the same entity:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "PUBCHEM.COMPOUND:5743",
                        "CHEMBL.COMPOUND:CHEMBL384467",
                        "DRUGBANK:DB01234"
                    ],
                    "categories": [
                        "biolink:SmallMolecule"
                    ]
                },
                "n1": {
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:physically_interacts_with"]
                }
            }
        }
    }
}
tokebe commented 2 years ago

@colleenXu The above query has its duplicate IDs removed as expected after https://github.com/biothings/bte_trapi_query_graph_handler/pull/56 was merged. I think this issue is ready to close?

colleenXu commented 2 years ago

should be deployed to prod, so closing