NCATSTranslator / minihackathons

MIT License
5 stars 5 forks source link

Spurious query C.2 results for systemic scleroderma #305

Closed rtroper closed 2 years ago

rtroper commented 2 years ago

Query C.2 was run using systemic scleroderma (MONDO:0005100) as the disease node and tocilizumab (CHEMBL.COMPOUND:CHEMBL1237022) as the specified drug node. Results are here: https://arax.ncats.io/?r=27582.

Several result graphs legitimately contain systemic scleroderma e.g. see Results 3 - 5 (methotrexate, prednisone, cyclophosphamide). However, there are also several results that have other diseases or concepts in place of systemic scleroderma. Below, are some examples.

Upon closer inspection, it appears that for each of the unique drug results (node 3 in the query), a result exists for systemic scleroderma as well as each of the conditions/concepts above (lymphocyte count, hypothyroidism, Crohn's disease, leukocyte count). So, it may be that the set of unique drug results are legitimate, but that the replicate results with these other diseases/concepts are the consequence of a disease synonymization/normalization bug somewhere.

I'm not sure where this issue is arising, so I'm not sure who to assign to. For now, assigning to the ARAX group since the linked results, above, come from ARAX.

Here is the full query for reference:

{
    "workflow": [
        {
            "id": "fill"
        },
        {
            "id": "bind"
        },
        {
            "id": "overlay_compute_ngd",
            "parameters": {
                "virtual_relation_label": "N1",
                "qnode_keys": [
                    "n3",
                    "n1"
                ]
            }
        },
        {
            "id": "overlay_compute_ngd",
            "parameters": {
                "virtual_relation_label": "N2",
                "qnode_keys": [
                    "n2",
                    "n3"
                ]
            }
        },
        {
            "id": "complete_results"
        },
        {
            "id": "filter_results_top_n",
            "parameters": {
                "max_results": 500
            }
        }
    ],
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": [
                        "biolink:Gene"
                    ],
                    "is_set": true,
                    "constraints": []
                },
                "n1": {
                    "ids": [
                        "CHEMBL.COMPOUND:CHEMBL1237022"
                    ],
                    "categories": [
                        "biolink:SmallMolecule"
                    ],
                    "is_set": false,
                    "constraints": []
                },
                "n2": {
                    "ids": [
                        "MONDO:0005100"
                    ],
                    "categories": [
                        "biolink:Disease"
                    ],
                    "is_set": false,
                    "constraints": []
                },
                "n3": {
                    "categories": [
                        "biolink:SmallMolecule"
                    ],
                    "is_set": false,
                    "constraints": []
                }
            },
            "edges": {
                "e01": {
                    "predicates": [
                        "biolink:interacts_with"
                    ],
                    "subject": "n0",
                    "object": "n1",
                    "constraints": [],
                    "exclude": false
                },
                "e02": {
                    "predicates": [
                        "biolink:genetic_association"
                    ],
                    "subject": "n0",
                    "object": "n2",
                    "constraints": [],
                    "exclude": false
                },
                "e03": {
                    "predicates": [
                        "biolink:interacts_with"
                    ],
                    "subject": "n0",
                    "object": "n3",
                    "constraints": [],
                    "exclude": false
                }
            }
        }
    }
}
dkoslicki commented 2 years ago

@amykglen Is this a Genetics Provider issue? In spot checking, it appears that all the spurious edges are from them, and not from RTX-KG2. @NCATSTranslator/genetics-provider Are you aware of this error?

@amykglen , perhaps a quick fix is to have expand check if the returned results are synonymous with the node given in the input query.

amykglen commented 2 years ago

yes, I concur with @dkoslicki that this seems to be a Genetics KP issue.

one problem with @dkoslicki's proposed patch is that they could be rightfully returning some diseases that are subclasses of systemic scleroderma, and our synonymization wouldn't know that those are OK. though I suppose we could accept throwing those out as part of the temporary patch.

does anyone at Genetics KP have an estimate as to when this could be addressed? (@marcdubybroad) depending on that we can decide if it's worth putting a patch in place on our end.

marcdubybroad commented 2 years ago

I'll look into this after the 3pm meeting.

marcdubybroad commented 2 years ago

We do have an issue where we return the original submitted curie but return the descendant disease name which we are in process of fixing. Could this be causing this issue?

dkoslicki commented 2 years ago

@marcdubybroad I don't think descendants is the issue here: these diseases are not descendants of scleroderma

marcdubybroad commented 2 years ago

If I submit the following one hop query to the genetics kp, I get no results returned. I assume that other curies are being provided to the genetics kp. Does anyone have these? { "message": { "query_graph": { "edges": { "e00": { "subject": "n00", "object": "n01" } }, "nodes": { "n00": { "categories": ["biolink:Gene"] }, "n01": { "ids": ["MONDO:0005100"] } } } } }

amykglen commented 2 years ago

yes, here's the problem query (for qedge e02): https://arax.ncats.io/api/arax/v1.2/status?id=21952

marcdubybroad commented 2 years ago

Fixed and deployed. Will create a unit test for this issue for future integration testing.

amykglen commented 2 years ago

awesome, thanks @marcdubybroad! confirmed the problem appears resolved in the larger ARAX query: https://arax.ncats.io/?r=27989 (looks like Genetics KP doesn't find any answers for e02)

think this issue should be good to close, @rtroper?

rtroper commented 2 years ago

Excellent, thank you, everyone! That was fast. The results look great now. I'll go ahead and close it.