Open acevedol opened 6 months ago
I checked the edges with subject UMLS:C0267167
between KG2.8.6 and KG2.8.4 and found that most of the edges match, except for 2.8.4's UMLS:C0267167---SEMMEDDB:augments---biolink:causes---activity_or_abundance---None---UMLS:C0013299---SEMMEDDB:
does not match 2.8.6's 'UMLS:C0013299---SEMMEDDB:coexists_with---None---None---None---UMLS:C0267167---SEMMEDDB:' and 'UMLS:C0013299---SEMMEDDB:causes---None---None---None---UMLS:C0267167---SEMMEDDB:'
In KG2.8.6c, the first node is MONDO:0005036
(gastric adenocarcinoma)
In KG2.8.6c, the second node is MONDO:0002268
(dyspepsia), I think
This is the edge that we thought was inverted in KG2.8.6pre:
{
"predicate": "biolink:has_phenotype",
"primary_knowledge_source": "infores:ncit",
"domain_range_exclusion": "False",
"publications_info": "{}",
"kg2_ids": [
"UMLS:C0013395---NCIT:disease_may_have_finding---None---None---None---UMLS:C0278701---umls_source:NCI"
],
"subject": "MONDO:0002268",
"id": "1467367",
"object": "MONDO:0005036"
}
The issue was first detected with the failing of the following pytest in test_ARAX_expand.py
def test_curie_list_query():
actions_list = [
"add_qnode(ids=[DOID:6419, DOID:3717, DOID:11406], key=n00)",
"add_qnode(categories=biolink:PhenotypicFeature, key=n01)",
"add_qedge(subject=n00, object=n01, predicates=biolink:has_phenotype, key=e00)",
"expand(kp=infores:rtx-kg2)",
"return(message=true, store=false)"
]
nodes_by_qg_id, edges_by_qg_id = _run_query_and_do_standard_testing(actions_list)
assert len(nodes_by_qg_id["n00"]) >= 3
The result after running the above query in KG2.8.4 ARAX UI is
Observe the direction of edge. This is the correct direction.
Here is the incorrect edge, in /home/ubuntu/kg2-build/kg2-umls-edges.jsonl
on kg286build.rtx.ai
:
ubuntu@ip-172-31-50-116:~/kg2-build$ grep 'C0278701' kg2-umls-edges.jsonl | grep C0013395
{"domain_range_exclusion": false, "id": "UMLS:C0013395---NCIT:disease_may_have_finding---None---None---None---UMLS:C0278701---umls_source:NCI", "negated": false, "object": "UMLS:C0278701", "predicate": null, "primary_knowledge_source": "umls_source:NCI", "publications": [], "publications_info": {}, "qualified_object_aspect": null, "qualified_object_direction": null, "qualified_predicate": null, "relation_label": "disease_may_have_finding", "source_predicate": "NCIT:disease_may_have_finding", "subject": "UMLS:C0013395", "update_date": "2023"}
{"domain_range_exclusion": false, "id": "UMLS:C0278701---NCIT:may_be_finding_of_disease---None---None---None---UMLS:C0013395---umls_source:NCI", "negated": false, "object": "UMLS:C0013395", "predicate": null, "primary_knowledge_source": "umls_source:NCI", "publications": [], "publications_info": {}, "qualified_object_aspect": null, "qualified_object_direction": null, "qualified_predicate": null, "relation_label": "may_be_finding_of_disease", "source_predicate": "NCIT:may_be_finding_of_disease", "subject": "UMLS:C0278701", "update_date": "2023"}
the second of the two edges gets deleted in the build process. The first of the two edges shown looks to be inverted. Do not know if this is due to a bug in the KG2pre build system or due to an issue in UMLS.
This is the edge in KG2.8.6pre:
{
"predicate": "biolink:has_phenotype",
"domain_range_exclusion": "False",
"negated": "False",
"primary_knowledge_source": "infores:ncit",
"relation_label": "disease_may_have_finding",
"publications_info": "{}",
"subject": "UMLS:C0013395",
"source_predicate": "NCIT:disease_may_have_finding",
"predicate_label": "disease_may_have_finding",
"id": "UMLS:C0013395---NCIT:disease_may_have_finding---None---None---None---UMLS:C0278701---umls_source:NCI",
"update_date": "2023",
"object": "UMLS:C0278701"
}
Evidence from the MRREL
table in the umls
mysql database on kg286build.rtx.ai
:
mysql> select * from MRREL where CUI1='C0013395' and CUI2='C0278701';
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
| CUI1 | AUI1 | STYPE1 | REL | CUI2 | AUI2 | STYPE2 | RELA | RUI | SRUI | SAB | SL | RG | DIR | SUPPRESS | CVF |
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
| C0013395 | A7570180 | SCUI | RO | C0278701 | A7597390 | SCUI | disease_may_have_finding | R168808278 | NULL | NCI | NCI | NULL | NULL | N | NULL |
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
1 row in set (0.00 sec)
@saramsey can you post the example of the query where we queried all UMLS/NCIT edges and limit it to 20, observing that all the edges that were returned showed multiple inverted relations between an asian subsect of people and their country? This was the final query that confirmed out suspicions.
Checking subject and object for UMLS:
mysql> select * from MRREL where RELA is not NULL and RELA <> 'associated_with' and RELA <> 'clinically_associated_with' and RELA <> 'co-occurs_with' and RELA <> 'ddx' and RELA <> 'ssc' limit 10;
+----------+----------+--------+-----+----------+----------+--------+-------------+-----------+------+-----+-----+------+------+----------+------+
| CUI1 | AUI1 | STYPE1 | REL | CUI2 | AUI2 | STYPE2 | RELA | RUI | SRUI | SAB | SL | RG | DIR | SUPPRESS | CVF |
+----------+----------+--------+-----+----------+----------+--------+-------------+-----------+------+-----+-----+------+------+----------+------+
| C0005790 | A2773839 | AUI | RO | C0005778 | A2773838 | AUI | measured_by | R00884377 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1255279 | A2774262 | AUI | RO | C3537249 | A2774080 | AUI | measured_by | R00884378 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1255446 | A2774263 | AUI | RO | C0002520 | A2773809 | AUI | measured_by | R00884379 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1255552 | A2774264 | AUI | RO | C0596019 | A2773764 | AUI | measured_by | R00884380 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1254417 | A2774268 | AUI | RO | C0004611 | A2773759 | AUI | measured_by | R00884381 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1254418 | A2774269 | AUI | RO | C0004611 | A2773759 | AUI | measured_by | R00884382 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1254394 | A2774270 | AUI | RO | C0004611 | A2773759 | AUI | measured_by | R00884383 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1254416 | A2774271 | AUI | RO | C0004611 | A2773759 | AUI | measured_by | R00884384 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1254377 | A2774272 | AUI | RO | C0004611 | A2773759 | AUI | measured_by | R00884385 | NULL | CPM | CPM | NULL | N | N | NULL |
| C1254387 | A2774273 | AUI | RO | C0004611 | A2773759 | AUI | measured_by | R00884386 | NULL | CPM | CPM | NULL | N | N | NULL |
+----------+----------+--------+-----+----------+----------+--------+-------------+-----------+------+-----+-----+------+------+----------+------+
10 rows in set (0.14 sec)
From empirically testing the first two rows, CUI2 is likely the subject and CUI1 is the object.
This specific relationship appears to be correct in UMLS:
mysql> select * from MRREL where CUI1='C0013395' and CUI2='C0278701';
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
| CUI1 | AUI1 | STYPE1 | REL | CUI2 | AUI2 | STYPE2 | RELA | RUI | SRUI | SAB | SL | RG | DIR | SUPPRESS | CVF |
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
| C0013395 | A7570180 | SCUI | RO | C0278701 | A7597390 | SCUI | disease_may_have_finding | R168808278 | NULL | NCI | NCI | NULL | NULL | N | NULL |
+----------+----------+--------+-----+----------+----------+--------+--------------------------+------------+------+-----+-----+------+------+----------+------+
The edge with the id: UMLS:C1519427---NCIT:isa---None---None---None---UMLS:C1553327---umls_source:NCI
in KG2.8.6pre shows that South Asian People
is a subclass of Sri Lankan
which is factually false. This is the edge that confirmed to us that the edges got inverted.
@saramsey can you post the example of the query where we queried all UMLS/NCIT edges and limit it to 20, observing that all the edges that were returned showed multiple inverted relations between an asian subsect of people and their country? This was the final query that confirmed out suspicions.
Which query? Cypher query to the KG2pre Neo4j?
@saramsey can you post the example of the query where we queried all UMLS/NCIT edges and limit it to 20, observing that all the edges that were returned showed multiple inverted relations between an asian subsect of people and their country? This was the final query that confirmed out suspicions.
Which query? Cypher query to the KG2pre Neo4j?
Yes. I believe it was something like this
match (n)-[r {primary_knowledge_source: "infores:ncit"}]-(m) return n.name, r.predicate, m.name limit 100
Thank you @sundareswarpullela.
@acevedol can you please run the above query against the KG2.8.6pre Neo4j and paste the results here?
@saramsey Do you mind reviewing my reasoning here?
In umls_mysql_to_list_jsonl.py
, the relations seems to be extracted using
relations_sql_statement = "SELECT DISTINCT CUI1, REL, RELA, DIR, CUI2, SAB FROM MRREL WHERE SAB IN " + sources_where
then
(cui_object, rel, rela, direction, cui_subject, source) = result
which leads me to believe that CUI1 is correctly labeled as cui_object and CUI2 as cui_subject, following the observations above.
However, the relation is added with
relation_type_key = ','.join([str(rel), str(rela), str(direction)])
if source not in cui_source_info[key][relation_key]:
cui_source_info[key][relation_key][source] = dict()
if relation_type_key not in cui_source_info[key][relation_key][source]:
cui_source_info[key][relation_key][source][relation_type_key] = list()
cui_source_info[key][relation_key][source][relation_type_key].append(cui_subject)
Then the relations are read into a kg2-edges.jsonl
file by umls_list_jsonl_to_kg_jsonl.py
.
My understanding of this is that the key should be from subject instead of cui_object.
Comparing to semmeddb_tuplelist_json_to_kg_jsonl.py
, subject comes first: key = subject_curie + '-' + predicate + '-' + object_curie
@acevedol confirms the latest code is in master
, pertinent to this issue
Thank you @sundareswarpullela.
@acevedol can you please run the above query against the KG2.8.6pre Neo4j and paste the results here?
I have attached the complete result but the screen shots we see, RaceAsian
is a subclass
of Bhutanese
. I'm not sure if its inherited from the raw UMLS database or it is due to the inverted edges.
export.csv
(CUI1) C0013395: dyspepsia (CUI2) C0278701: gastric adenocarcinoma
So CUI2 is the subject, and CUI1 is the object.
I suspect the issue may be here: https://github.com/RTXteam/RTX-KG2/blob/11eeb4d825497619012b245af96ede3b00881926/umls_util.py#L129
i.e., that variable names subject_id
on L129 and object_id
on L140 should be swapped. Tagging @acevedol.
On test build of KG2.9.0, I did not find the problematic edge "UMLS:C0013395---NCIT:disease_may_have_finding---None---None---None---UMLS:C0278701---umls_source:NCI"
The reversed edge is present
{"domain_range_exclusion": false, "id": "UMLS:C0278701---NCIT:disease_may_have_finding---None---None---None---UMLS:C0013395---umls_source:NCI", "negated": false, "object": "UMLS:C0013395", "predicate": "biolink:has_phenotype", "predicate_label": "disease_may_have_finding", "primary_knowledge_source": "infores:ncit", "publications": [], "publications_info": {}, "qualified_object_aspect": null, "qualified_object_direction": null, "qualified_predicate": null, "relation_label": "disease_may_have_finding", "source_predicate": "NCIT:disease_may_have_finding", "subject": "UMLS:C0278701", "update_date": "2023"}
From @sundareswarpullela: Bug found in ARAX Pytest https://github.com/RTXteam/RTX/blob/4c776b4a27d96e1462173abe7963792608d8b879/code/ARAX/test/test_ARAX_expand.py#L247-L257C1
Example:
{'nodes': {'n00': {'ids': ['MONDO:0001280', 'MONDO:0008542', 'MONDO:0005036'], 'categories': ['biolink:DiseaseOrPhenotypicFeature', 'biolink:PhenotypicFeature', 'biolink:Disease'], 'is_set': True, 'constraints': [], 'option_group_id': None}, 'n01': {'ids': None, 'categories': ['biolink:DiseaseOrPhenotypicFeature', 'biolink:PhenotypicFeature', 'biolink:Disease'], 'is_set': True, 'constraints': [], 'option_group_id': None}}, 'edges': {'e00': {'knowledge_type': None, 'predicates': ['biolink:has_phenotype'], 'subject': 'n00', 'object': 'n01', 'attribute_constraints': [], 'qualifier_constraints': [], 'exclude': None, 'option_group_id': None}}}