RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
34 stars 9 forks source link

Inverted UMLS (NCIT) Edge #356

Open acevedol opened 6 months ago

acevedol commented 6 months ago

From @sundareswarpullela: Bug found in ARAX Pytest https://github.com/RTXteam/RTX/blob/4c776b4a27d96e1462173abe7963792608d8b879/code/ARAX/test/test_ARAX_expand.py#L247-L257C1

Example: {'nodes': {'n00': {'ids': ['MONDO:0001280', 'MONDO:0008542', 'MONDO:0005036'], 'categories': ['biolink:DiseaseOrPhenotypicFeature', 'biolink:PhenotypicFeature', 'biolink:Disease'], 'is_set': True, 'constraints': [], 'option_group_id': None}, 'n01': {'ids': None, 'categories': ['biolink:DiseaseOrPhenotypicFeature', 'biolink:PhenotypicFeature', 'biolink:Disease'], 'is_set': True, 'constraints': [], 'option_group_id': None}}, 'edges': {'e00': {'knowledge_type': None, 'predicates': ['biolink:has_phenotype'], 'subject': 'n00', 'object': 'n01', 'attribute_constraints': [], 'qualifier_constraints': [], 'exclude': None, 'option_group_id': None}}}

The UMLS edge between gastric adenocarcinoma and dyspepsia was the first one we discovered to be inverted

Then we ran a general query of all UMLS edges and limit it to 20 and noticed that all edges are inverted.

The edge between adeno gastroenteritis and dyspepsia was a biolink:has_phenotype edge

acevedol commented 6 months ago

I checked the edges with subject UMLS:C0267167 between KG2.8.6 and KG2.8.4 and found that most of the edges match, except for 2.8.4's UMLS:C0267167---SEMMEDDB:augments---biolink:causes---activity_or_abundance---None---UMLS:C0013299---SEMMEDDB: does not match 2.8.6's 'UMLS:C0013299---SEMMEDDB:coexists_with---None---None---None---UMLS:C0267167---SEMMEDDB:' and 'UMLS:C0013299---SEMMEDDB:causes---None---None---None---UMLS:C0267167---SEMMEDDB:'

saramsey commented 6 months ago

In KG2.8.6c, the first node is MONDO:0005036 (gastric adenocarcinoma)

saramsey commented 6 months ago

In KG2.8.6c, the second node is MONDO:0002268 (dyspepsia), I think

saramsey commented 6 months ago

This is the edge that we thought was inverted in KG2.8.6pre:

{
  "predicate": "biolink:has_phenotype",
  "primary_knowledge_source": "infores:ncit",
  "domain_range_exclusion": "False",
  "publications_info": "{}",
  "kg2_ids": [
    "UMLS:C0013395---NCIT:disease_may_have_finding---None---None---None---UMLS:C0278701---umls_source:NCI"
  ],
  "subject": "MONDO:0002268",
  "id": "1467367",
  "object": "MONDO:0005036"
}
sundareswarpullela commented 6 months ago

The issue was first detected with the failing of the following pytest in test_ARAX_expand.py

def test_curie_list_query():
    actions_list = [
        "add_qnode(ids=[DOID:6419, DOID:3717, DOID:11406], key=n00)",
        "add_qnode(categories=biolink:PhenotypicFeature, key=n01)",
        "add_qedge(subject=n00, object=n01, predicates=biolink:has_phenotype, key=e00)",
        "expand(kp=infores:rtx-kg2)",
        "return(message=true, store=false)"
    ]
    nodes_by_qg_id, edges_by_qg_id = _run_query_and_do_standard_testing(actions_list)
    assert len(nodes_by_qg_id["n00"]) >= 3

The result after running the above query in KG2.8.4 ARAX UI is Screenshot 2023-12-18 at 3 34 25 PM Observe the direction of edge. This is the correct direction.

saramsey commented 6 months ago

Here is the incorrect edge, in /home/ubuntu/kg2-build/kg2-umls-edges.jsonl on kg286build.rtx.ai:

ubuntu@ip-172-31-50-116:~/kg2-build$ grep 'C0278701' kg2-umls-edges.jsonl | grep C0013395
{"domain_range_exclusion": false, "id": "UMLS:C0013395---NCIT:disease_may_have_finding---None---None---None---UMLS:C0278701---umls_source:NCI", "negated": false, "object": "UMLS:C0278701", "predicate": null, "primary_knowledge_source": "umls_source:NCI", "publications": [], "publications_info": {}, "qualified_object_aspect": null, "qualified_object_direction": null, "qualified_predicate": null, "relation_label": "disease_may_have_finding", "source_predicate": "NCIT:disease_may_have_finding", "subject": "UMLS:C0013395", "update_date": "2023"}
{"domain_range_exclusion": false, "id": "UMLS:C0278701---NCIT:may_be_finding_of_disease---None---None---None---UMLS:C0013395---umls_source:NCI", "negated": false, "object": "UMLS:C0013395", "predicate": null, "primary_knowledge_source": "umls_source:NCI", "publications": [], "publications_info": {}, "qualified_object_aspect": null, "qualified_object_direction": null, "qualified_predicate": null, "relation_label": "may_be_finding_of_disease", "source_predicate": "NCIT:may_be_finding_of_disease", "subject": "UMLS:C0278701", "update_date": "2023"}

the second of the two edges gets deleted in the build process. The first of the two edges shown looks to be inverted. Do not know if this is due to a bug in the KG2pre build system or due to an issue in UMLS.

saramsey commented 6 months ago

This is the edge in KG2.8.6pre:

{
  "predicate": "biolink:has_phenotype",
  "domain_range_exclusion": "False",
  "negated": "False",
  "primary_knowledge_source": "infores:ncit",
  "relation_label": "disease_may_have_finding",
  "publications_info": "{}",
  "subject": "UMLS:C0013395",
  "source_predicate": "NCIT:disease_may_have_finding",
  "predicate_label": "disease_may_have_finding",
  "id": "UMLS:C0013395---NCIT:disease_may_have_finding---None---None---None---UMLS:C0278701---umls_source:NCI",
  "update_date": "2023",
  "object": "UMLS:C0278701"
}
saramsey commented 6 months ago

Evidence from the MRREL table in the umls mysql database on kg286build.rtx.ai:

mysql> select * from MRREL where CUI1='C0013395' and CUI2='C0278701';
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
| CUI1     | AUI1     | STYPE1 | REL | CUI2     | AUI2     | STYPE2 | RELA                     | RUI        | SRUI | SAB | SL  | RG   | DIR  | SUPPRESS | CVF  |
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
| C0013395 | A7570180 | SCUI   | RO  | C0278701 | A7597390 | SCUI   | disease_may_have_finding | R168808278 | NULL | NCI | NCI | NULL | NULL | N        | NULL |
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
1 row in set (0.00 sec)
sundareswarpullela commented 6 months ago

@saramsey can you post the example of the query where we queried all UMLS/NCIT edges and limit it to 20, observing that all the edges that were returned showed multiple inverted relations between an asian subsect of people and their country? This was the final query that confirmed out suspicions.

saramsey commented 6 months ago

Checking subject and object for UMLS:

mysql> select * from MRREL where RELA is not NULL and RELA <> 'associated_with' and RELA <> 'clinically_associated_with' and RELA <> 'co-occurs_with' and RELA <> 'ddx' and RELA <> 'ssc' limit 10;
+----------+----------+--------+-----+----------+----------+--------+-------------+-----------+------+-----+-----+------+------+----------+------+
| CUI1     | AUI1     | STYPE1 | REL | CUI2     | AUI2     | STYPE2 | RELA        | RUI       | SRUI | SAB | SL  | RG   | DIR  | SUPPRESS | CVF  |
+----------+----------+--------+-----+----------+----------+--------+-------------+-----------+------+-----+-----+------+------+----------+------+
| C0005790 | A2773839 | AUI    | RO  | C0005778 | A2773838 | AUI    | measured_by | R00884377 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1255279 | A2774262 | AUI    | RO  | C3537249 | A2774080 | AUI    | measured_by | R00884378 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1255446 | A2774263 | AUI    | RO  | C0002520 | A2773809 | AUI    | measured_by | R00884379 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1255552 | A2774264 | AUI    | RO  | C0596019 | A2773764 | AUI    | measured_by | R00884380 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1254417 | A2774268 | AUI    | RO  | C0004611 | A2773759 | AUI    | measured_by | R00884381 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1254418 | A2774269 | AUI    | RO  | C0004611 | A2773759 | AUI    | measured_by | R00884382 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1254394 | A2774270 | AUI    | RO  | C0004611 | A2773759 | AUI    | measured_by | R00884383 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1254416 | A2774271 | AUI    | RO  | C0004611 | A2773759 | AUI    | measured_by | R00884384 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1254377 | A2774272 | AUI    | RO  | C0004611 | A2773759 | AUI    | measured_by | R00884385 | NULL | CPM | CPM | NULL | N    | N        | NULL |
| C1254387 | A2774273 | AUI    | RO  | C0004611 | A2773759 | AUI    | measured_by | R00884386 | NULL | CPM | CPM | NULL | N    | N        | NULL |
+----------+----------+--------+-----+----------+----------+--------+-------------+-----------+------+-----+-----+------+------+----------+------+
10 rows in set (0.14 sec)
saramsey commented 6 months ago

From empirically testing the first two rows, CUI2 is likely the subject and CUI1 is the object.

saramsey commented 6 months ago

This specific relationship appears to be correct in UMLS:

mysql> select * from MRREL where CUI1='C0013395' and CUI2='C0278701';
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
| CUI1     | AUI1     | STYPE1 | REL | CUI2     | AUI2     | STYPE2 | RELA                     | RUI        | SRUI | SAB | SL  | RG   | DIR  | SUPPRESS | CVF  |
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
| C0013395 | A7570180 | SCUI   | RO  | C0278701 | A7597390 | SCUI   | disease_may_have_finding | R168808278 | NULL | NCI | NCI | NULL | NULL | N        | NULL |
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
sundareswarpullela commented 6 months ago

The edge with the id: UMLS:C1519427---NCIT:isa---None---None---None---UMLS:C1553327---umls_source:NCI in KG2.8.6pre shows that South Asian People is a subclass of Sri Lankan which is factually false. This is the edge that confirmed to us that the edges got inverted.

saramsey commented 6 months ago

@saramsey can you post the example of the query where we queried all UMLS/NCIT edges and limit it to 20, observing that all the edges that were returned showed multiple inverted relations between an asian subsect of people and their country? This was the final query that confirmed out suspicions.

Which query? Cypher query to the KG2pre Neo4j?

sundareswarpullela commented 6 months ago

@saramsey can you post the example of the query where we queried all UMLS/NCIT edges and limit it to 20, observing that all the edges that were returned showed multiple inverted relations between an asian subsect of people and their country? This was the final query that confirmed out suspicions.

Which query? Cypher query to the KG2pre Neo4j?

Yes. I believe it was something like this


match (n)-[r {primary_knowledge_source: "infores:ncit"}]-(m) return n.name, r.predicate, m.name limit 100
saramsey commented 6 months ago

Thank you @sundareswarpullela.

@acevedol can you please run the above query against the KG2.8.6pre Neo4j and paste the results here?

acevedol commented 6 months ago

@saramsey Do you mind reviewing my reasoning here? In umls_mysql_to_list_jsonl.py, the relations seems to be extracted using relations_sql_statement = "SELECT DISTINCT CUI1, REL, RELA, DIR, CUI2, SAB FROM MRREL WHERE SAB IN " + sources_where then (cui_object, rel, rela, direction, cui_subject, source) = result which leads me to believe that CUI1 is correctly labeled as cui_object and CUI2 as cui_subject, following the observations above. However, the relation is added with

relation_type_key = ','.join([str(rel), str(rela), str(direction)])
        if source not in cui_source_info[key][relation_key]:
            cui_source_info[key][relation_key][source] = dict()
        if relation_type_key not in cui_source_info[key][relation_key][source]:
            cui_source_info[key][relation_key][source][relation_type_key] = list()
        cui_source_info[key][relation_key][source][relation_type_key].append(cui_subject)

Then the relations are read into a kg2-edges.jsonl file by umls_list_jsonl_to_kg_jsonl.py. My understanding of this is that the key should be from subject instead of cui_object. Comparing to semmeddb_tuplelist_json_to_kg_jsonl.py, subject comes first: key = subject_curie + '-' + predicate + '-' + object_curie

saramsey commented 6 months ago

@acevedol confirms the latest code is in master, pertinent to this issue

sundareswarpullela commented 6 months ago

Thank you @sundareswarpullela.

@acevedol can you please run the above query against the KG2.8.6pre Neo4j and paste the results here?

I have attached the complete result but the screen shots we see, RaceAsian is a subclass of Bhutanese. I'm not sure if its inherited from the raw UMLS database or it is due to the inverted edges. export.csv image

saramsey commented 5 months ago

(CUI1) C0013395: dyspepsia (CUI2) C0278701: gastric adenocarcinoma

So CUI2 is the subject, and CUI1 is the object.

saramsey commented 5 months ago

I suspect the issue may be here: https://github.com/RTXteam/RTX-KG2/blob/11eeb4d825497619012b245af96ede3b00881926/umls_util.py#L129

i.e., that variable names subject_id on L129 and object_id on L140 should be swapped. Tagging @acevedol.

acevedol commented 4 months ago

On test build of KG2.9.0, I did not find the problematic edge "UMLS:C0013395---NCIT:disease_may_have_finding---None---None---None---UMLS:C0278701---umls_source:NCI"

KG2 9 0 UMLS Edge

The reversed edge is present

{"domain_range_exclusion": false, "id": "UMLS:C0278701---NCIT:disease_may_have_finding---None---None---None---UMLS:C0013395---umls_source:NCI", "negated": false, "object": "UMLS:C0013395", "predicate": "biolink:has_phenotype", "predicate_label": "disease_may_have_finding", "primary_knowledge_source": "infores:ncit", "publications": [], "publications_info": {}, "qualified_object_aspect": null, "qualified_object_direction": null, "qualified_predicate": null, "relation_label": "disease_may_have_finding", "source_predicate": "NCIT:disease_may_have_finding", "subject": "UMLS:C0278701", "update_date": "2023"}