biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

Improper merging of PFOCR edge attributes from multiple Records #463

Closed tokebe closed 1 year ago

tokebe commented 2 years ago

When a single edge is derived from multiple Records, the edge attributes appear to be merged improperly, at least in the case of the attribute figure_url, figure_title, pmc_reference from PFOCR. What appears to be happening is that only one figure_url is kept, instead of each being merged into an array.

To replicate

In the workspace:

Query:

URL: http://localhost:3000/v1/smartapi/edeb26858bd27d0322af93e7a9e08761/query

Body:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:SmallMolecule"],
                    "ids": ["MESH:C000541"]
                },
                "n1": {
                    "categories": ["biolink:Gene"],
                    "ids": ["NCBIGene:208"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

This query should be returning 9 figures instead of the 1 that appears (attributes of the only edge in message.knowledge_graph.edges)

What has been confirmed

TODO

Tagging @colleenXu @ariutta for additional details.

colleenXu commented 2 years ago

What appears to be happening is that only one figure_url is kept, instead of each being merged into an array.

and maybe only one figure_title, pmc_reference is being kept as well...

Basically, I think the "last" Record processed is what is kept (all prior ones are overwritten?)...and instead we want to make the values arrays and add elements to them...

Something like this is already done for other apis ingested through x-bte annotation, like semmeddb (think pubmed IDs) and MyVariant

ariutta commented 2 years ago

Maybe semmeddb would have an equivalent test case to see whether the problem happens there too?

ariutta commented 2 years ago

Without going step-by-step through the code, I'm not aware off the top of my head what would be causing this.

rjawesome commented 2 years ago

I think I figured out where the issue is being caused in the code, I can probably make a PR soon if I am correct?

rjawesome commented 2 years ago

This is the expected behavior, correct? (figure_url and pmc_reference are also arrays). The file I am currently editing for my fix is query_graph_hander/graph/kg_edge.js. Also, would we like to make these fields an array if there is a single value or just keep it as that single value? image

colleenXu commented 2 years ago

the screenshot looks good, ask @tokebe whether you're working with the correct file.

For single values, I think it's fine to leave them as single values (not 1-element arrays). But if you see something different in the code (like elsewhere single values are converted into 1-element arrays), let us know...

tokebe commented 2 years ago

Screenshot looks good, make a PR as soon as you're ready.

colleenXu commented 2 years ago

on a related note @rjawesome @tokebe I think "sets" (aka unique values only) is more useful than "lists". I've noticed that problem with stuff from MyVariant; look at the civic stuff on edges in the following query.

However, maybe this is something separate enough to be a different issue?

POST to MyVariant specifically: http://localhost:3000/v1/smartapi/09c8782d9f4027712e65b95424adba79/query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["DBSNP:rs121913521"],
                    "categories":["biolink:SequenceVariant"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Example:

                        {
                            "attribute_type_id": "civic_clinical_significance",
                            "value": [
                                "Sensitivity/Response",
                                "Sensitivity/Response",
                                "Sensitivity/Response",
                                "Sensitivity/Response",
                                "Resistance",
                                "Resistance",
                                "Resistance",
                                "Sensitivity/Response",
                                "Sensitivity/Response",
                                "Sensitivity/Response"
                            ]
                        },
rjawesome commented 2 years ago

I can make each attribute into a Set, that should be no problem.

rjawesome commented 2 years ago

@colleenXu Hmm, your query seems to create another issue, apparently, sometimes an array is being put as the attribute in the record, for example for your query, these two arrays are in the two records that combine to form this edge

[
  [
      "Sensitivity/Response",
      "Sensitivity/Response",
      "Sensitivity/Response",
      "Resistance",
      "Sensitivity/Response",
      "Resistance",
      "Sensitivity/Response"
  ],
  [
      "Sensitivity/Response",
      "Sensitivity/Response",
      "Sensitivity/Response",
      "Sensitivity/Response",
      "Resistance",
      "Resistance",
      "Resistance",
      "Sensitivity/Response",
      "Sensitivity/Response",
      "Sensitivity/Response"
  ]
 ]                         

So, right now my code is making a set or array of these two. Would we like to flatten these arrays? (we can only do this if we are sure no values themselves are of the array type). The results you are showing are because the attributes are only taken from the last record

colleenXu commented 2 years ago

@rjawesome

I think what's happening is that this DBSNP ID actually corresponds to two hits in MyVariant, and the nested nature of the data is what's leading to this...

I think it'd be nice to flatten these arrays and then run a set operation / unique-values-only. I was hoping for something in the end like ["Sensitivity/Response", "Resistance"] for this particular MyVariant thing.

I currently can't think of cases where we wouldn't want to do this for an edge-attribute....but we'd have to test your stuff carefully once it's ready for testing.

rjawesome commented 2 years ago

@colleenXu Should be ready for testing, flattening of arrays/usage of set has been implemented.

colleenXu commented 2 years ago

@rjawesome @tokebe If I try to run the example query from the first post, I get a status 500 response.

console logs ``` bte:biothings-explorer-trapi:QueryResult initialQEdgeID: e01, initialQNodeIDToMatch: n0 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:442972_&_n1-NCBIGene:208 has 9 +1ms bte:biothings-explorer-trapi:QueryResult Successfully scored 0 results, couldn't score 1 results. +0ms bte:biothings-explorer-trapi:Graph pruning BTEGraph nodes/edges... +30ms bte:biothings-explorer-trapi:Graph pruned 0 nodes and 0 edges from BTEGraph. +0ms bte:biothings-explorer-trapi:error_handler ReferenceError: Arrays is not defined bte:biothings-explorer-trapi:error_handler at KnowledgeGraph._createAttributes (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/graph/knowledge_graph.js:162:28) bte:biothings-explorer-trapi:error_handler at KnowledgeGraph._createEdge (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/graph/knowledge_graph.js:174:30) bte:biothings-explorer-trapi:error_handler at /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/graph/knowledge_graph.js:182:37 bte:biothings-explorer-trapi:error_handler at Array.map () bte:biothings-explorer-trapi:error_handler at KnowledgeGraph.update (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/graph/knowledge_graph.js:181:37) bte:biothings-explorer-trapi:error_handler at /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/graph/graph.js:110:24 bte:biothings-explorer-trapi:error_handler at Array.map () bte:biothings-explorer-trapi:error_handler at BTEGraph.notify (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/graph/graph.js:109:26) bte:biothings-explorer-trapi:error_handler at TRAPIQueryHandler.query (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/index.js:625:23) bte:biothings-explorer-trapi:error_handler at processTicksAndRejections (node:internal/process/task_queues:96:5) +0ms ```
rjawesome commented 2 years ago

Ah I must have made a typo when copying the code over to the other PR... Fix pushed.

tokebe commented 1 year ago

Deployed to prod 🚀