biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

Some queries return undefined results, causing caching to fail #325

Closed tokebe closed 3 years ago

tokebe commented 3 years ago

This issue doesn't cause queries to BTE to fail a query, but does cause caching to fail for a given query.

Some queries in the call-apis level return undefined results, which caching is not able to handle.

Example query:

Long query, click to expand ``` { "message": { "query_graph": { "nodes": { "n1": { "categories": [ "biolink:DiseaseOrPhenotypicFeature" ], "is_set": false, "ids": [ "MONDO:0005071", "DOID:863", "SNOMEDCT:118940003", "SNOMEDCT:155262005", "NCIT:C26835", "UMLS:C0027765", "MESH:D009422", "MEDDRA:10013242", "MEDDRA:10029202", "MEDDRA:10029203", "MEDDRA:10029205", "MEDDRA:10029286", "MEDDRA:10029298", "MESH:C519298", "MESH:C071458", "MONDO:0002184", "DOID:2044", "SNOMEDCT:235889003", "UMLS:C0524912", "UMLS:C2717837", "UMLS:C4505492", "UMLS:C4505493", "MESH:D056487", "MESH:C040391", "UMLS:C0262505", "SNOMEDCT:235871004", "MEDDRA:10019737", "MEDDRA:10019741", "MONDO:0013282", "DOID:13372", "OMIM:613490", "ORPHANET:60", "SNOMEDCT:30188007", "NCIT:C84397", "UMLS:C0221757", "MESH:C531610", "MESH:D019896", "MEDDRA:10001806", "MEDDRA:10001811", "MONDO:0005366", "SNOMEDCT:61977001", "UMLS:C0524909", "MESH:D019694", "MEDDRA:10008910", "UMLS:C1856453", "MONDO:0000775", "DOID:0060500", "SNOMEDCT:416093006", "SNOMEDCT:416098002", "UMLS:C0013182", "UMLS:C5139486", "MESH:D004342", "HP:0410323", "MEDDRA:10013661", "MEDDRA:10013700", "MEDDRA:10082135", "MESH:C518324", "UMLS:C3276783", "UMLS:C4231138", "SNOMEDCT:237601000", "UMLS:C0342271", "MONDO:0018229", "DOID:0050426", "OMIM:142830", "OMIM:608579", "ORPHANET:36426", "SNOMEDCT:403609001", "SNOMEDCT:73442001", "SNOMEDCT:768946000", "NCIT:C79484", "EFO:0004276", "UMLS:C0038325", "UMLS:C1274933", "UMLS:C1837818", "UMLS:C1840547", "UMLS:C1840548", "UMLS:C1969756", "UMLS:C2608081", "UMLS:C2750833", "UMLS:C3277286", "UMLS:C3658301", "UMLS:C3658302", "UMLS:C4016206", "MESH:D013262", "MEDDRA:10006561", "MEDDRA:10015209", "MEDDRA:10015211", "MEDDRA:10015219", "MEDDRA:10015220", "MEDDRA:10042029", "MEDDRA:10042030", "MEDDRA:10042033", "MEDDRA:10042849", "MONDO:0005790", "DOID:12549", "SNOMEDCT:40468003", "NCIT:C3096", "UMLS:C0019159", "MESH:D006506", "MEDDRA:10019719", "MEDDRA:10019780", "MEDDRA:10019782", "MEDDRA:10021913", "MEDDRA:10047447", "UMLS:C3278891", "SNOMEDCT:235877000", "UMLS:C0473117", "MEDDRA:10023025", "MEDDRA:10023040", "MESH:C064613", "MONDO:0005267", "DOID:114", "SNOMEDCT:194707003", "SNOMEDCT:56265001", "NCIT:C3079", "EFO:0003777", "UMLS:C0018799", "UMLS:CN236661", "UMLS:CN239852", "MESH:D006331", "MEDDRA:10007540", "MEDDRA:10007541", "MEDDRA:10013199", "MEDDRA:10019276", "MEDDRA:10019277", "MEDDRA:10061024", "CHEBI:59683", "UMLS:C4049267", "MEDDRA:10076955", "MONDO:0043693", "SNOMEDCT:41309000", "NCIT:C34783", "UMLS:C0023896", "UMLS:C1442981", "MESH:D008108", "MEDDRA:10001626", "MEDDRA:10001627", "MEDDRA:10001628", "MEDDRA:10019844", "MESH:C110500", "MONDO:0013433", "DOID:0060643", "OMIM:613806", "ORPHANET:171", "SNOMEDCT:197441003", "UMLS:C0566602", "MESH:C536419", "MEDDRA:10036732", "UMLS:C0455417", "SNOMEDCT:266902008", "MONDO:0005354", "UMLS:C0524910", "MESH:D019698", "MEDDRA:10008912", "MONDO:0005359", "SNOMEDCT:197352008", "SNOMEDCT:235876009", "SNOMEDCT:427399008", "NCIT:C84427", "UMLS:C0019193", "UMLS:C0860207", "UMLS:C1262760", "UMLS:C3658290", "UMLS:C4277682", "UMLS:C4279912", "MESH:D056486", "MEDDRA:10013705", "MEDDRA:10013762", "MEDDRA:10019766", "MEDDRA:10019795", "MEDDRA:10072268", "MEDDRA:10072734", "MEDDRA:10072937", "UMLS:C1857414", "UMLS:C0019699", "SNOMEDCT:165816005", "MEDDRA:10020180", "MEDDRA:10020183", "MEDDRA:10020188", "MEDDRA:10020425", "MEDDRA:10036219", "NCIT:C15175", "MESH:D006679", "MONDO:0004335", "DOID:77", "SNOMEDCT:119292006", "SNOMEDCT:53619000", "NCIT:C2990", "UMLS:C0012242", "UMLS:C0017178", "UMLS:C0559031", "UMLS:C1565321", "UMLS:C4023588", "MESH:D004066", "MESH:D005767", "HP:0011024", "MEDDRA:10013225", "MEDDRA:10017876", "MEDDRA:10017922", "MEDDRA:10017944", "MEDDRA:10017945", "MEDDRA:10017947", "MEDDRA:10071275", "MONDO:0013209", "DOID:0080208", "SNOMEDCT:197315008", "NCIT:C84444", "UMLS:C0400966", "MESH:D065626", "MEDDRA:10029530", "MEDDRA:10082249", "UMLS:C0455540", "SNOMEDCT:161523006", "MESH:C093154", "UMLS:C4554323", "NCIT:C143255", "UMLS:C0149709", "SNOMEDCT:165806002", "MEDDRA:10019739", "MEDDRA:10019740", "MEDDRA:10019742", "MESH:C069356", "MESH:C115528", "MONDO:0007745", "DOID:2739", "OMIM:143500", "SNOMEDCT:27503000", "NCIT:C84729", "UMLS:C0017551", "MESH:D005878", "MEDDRA:10018267", "MONDO:0001475", "DOID:1227", "SNOMEDCT:191336001", "SNOMEDCT:303011007", "UMLS:C0027947", "MESH:D009503", "MEDDRA:10029354", "MEDDRA:10029355", "UMLS:C0948251", "MEDDRA:10052022", "UMLS:C2674487", "UMLS:C0022346", "UMLS:C0242183", "SNOMEDCT:18165001", "SNOMEDCT:60217008", "HP:0000952", "MEDDRA:10021207", "MEDDRA:10023126", "MEDDRA:10023132", "MEDDRA:10023135", "MEDDRA:10023139", "NCIT:C3143", "NCIT:C35299", "MESH:D007565", "MESH:D000081226" ] }, "n2": { "categories": [ "biolink:Gene" ], "is_set": false } }, "edges": { "e02": { "subject": "n2", "object": "n1", "predicates": [ "biolink:gene_associated_with_condition" ] } } } } } ```
tokebe commented 3 years ago

Results which become undefined after transformation appear to be in this format:

{
  node_bindings: {
    n0: [
      {
        id: "MONDO:0013662",
        qnode_id: "MONDO:0004335",
      },
    ],
    n1: [
      {
        id: "NCBIGene:9536",
      },
    ],
  },
  edge_bindings: {
    e01: [
      {
        id: "f8568a60293294505d45862886b12c90",
        attributes: null,
      },
    ],
  },
  score: null,
}

In particular, this is the first in an array of 5226 in the response to a query to Automat Hetio (trapi v-1.2.0)

colleenXu commented 3 years ago

@tokebe I'm wondering if something like this is the underlying query to Automat Hetio. (POST to https://automat.renci.org/hetio/1.2/query)

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["MONDO:0013662"],
                    "categories":["biolink:Disease"]
                },
                "n1": {
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:condition_associated_with_gene"]
                }
            }
        }
    }
}

In the response to this query, I see a few places where BTE could be tripped up but I'm not sure what's going on...

The edge mentioned above may be this one (this API is returning subject/object/predicate in weird order)...

Do you see anything that would trip BTE up with parsing / caching?

                "f8568a60293294505d45862886b12c90": {
                    "subject": "NCBIGene:9536",
                    "object": "MONDO:0013662",
                    "predicate": "biolink:gene_associated_with_condition",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:primary_knowledge_source",
                            "value": "infores:hetio",
                            "value_type_id": "biolink:InformationResource",
                            "original_attribute_name": "biolink:primary_knowledge_source",
                            "value_url": null,
                            "attribute_source": null,
                            "description": null,
                            "attributes": null
                        },
                        {
                            "attribute_type_id": "biolink:Attribute",
                            "value": [
                                "DisGeNET"
                            ],
                            "value_type_id": "EDAM:data_0006",
                            "original_attribute_name": "hetio_source",
                            "value_url": null,
                            "attribute_source": null,
                            "description": null,
                            "attributes": null
                        },
                        {
                            "attribute_type_id": "biolink:relation",
                            "value": "hetio:ASSOCIATES_DaG",
                            "value_type_id": "EDAM:data_0006",
                            "original_attribute_name": "relation",
                            "value_url": null,
                            "attribute_source": null,
                            "description": null,
                            "attributes": null
                        },
                        {
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": "infores:automat-hetio",
                            "value_type_id": "biolink:InformationResource",
                            "original_attribute_name": "biolink:aggregator_knowledge_source",
                            "value_url": null,
                            "attribute_source": null,
                            "description": null,
                            "attributes": null
                        }
                    ]
                },

I don't think we're taking anything from the knowledge_graph.nodes or results sections, but I can paste parts of those too if it helps.

tokebe commented 3 years ago

I've since found the exact spot where undefined is being added... in the TRAPI transformer, each edge is transformed individually. _updateInput is called on the edge, and expects to find the input (edgeBinding.subject) in the edge.original_input keys, but doesn't, meaning the edge's input.obj is left undefined, which causes _transformIndividualEdge to return undefined.

tokebe commented 3 years ago

In other words, the response from the API contains node bindings from n0 using IDs not present in the query graph's n0 ID array, causing a mismatch of the edge binding's subject and the original input.

colleenXu commented 3 years ago

Basically, it's a matter of where caching is done in the process.


The issue is that Automat Hetios is doing subentity-expansion and specifying the q-node-ID (which is different from the given node ID) in the results section. We normally ingest only the knowledge_graph.edge section of a TRAPI query which doesn't give us this kind of information.

BTE currently "cannot handle" these records and marks them as undefined (since the subject/object IDs don't match what was executed for the sub-query edge), and then removes those records...

the key was this comment, which refers to this merged PR and specifically the changes made here

colleenXu commented 3 years ago

Can this issue now be closed? @tokebe @newgene

Also is this related / a dupe of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/280