RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

Nonsensical core_predicates in KG2.8.2pre #269

Closed amykglen closed 1 year ago

amykglen commented 1 year ago

in trying to figure out what property biolink predicates are stored under in KG2.8.2pre, the first edge I looked at is very strange:

Screen Shot 2023-05-09 at 5 49 13 PM

it has a core_predicate of biolink:gene_associated_with_condition but a source_predicate of biolink:same_as... that doesn't make sense.

and there are many such edges - all of these edges have a source_predicate of biolink:same_as:

match (n)-[e]->(m) where e.source_predicate="biolink:same_as" return e.core_predicate, count(distinct e) order by count(distinct e) desc
e.core_predicate count(distinct e)
"biolink:treats" 110222
"biolink:has_participant" 109731
"biolink:related_to" 48679
"biolink:gene_associated_with_condition" 28875
"biolink:located_in" 15798
"biolink:interacts_with" 952
"biolink:subclass_of" 243
"biolink:occurs_in" 34
"biolink:physically_interacts_with" 1

I haven't checked whether edges with source_predicates other than biolink:same_as have this issue, where the source_predicate conflicts with the core_predicate.

amykglen commented 1 year ago

this query seems to find more instances of conflicting source_predicates vs. core_predicates:

match (n)-[e]->(m) where e.source_predicate starts with "biolink:" return e.source_predicate, e.core_predicate, count(distinct e) order by count(distinct e) desc
e.source_predicate e.core_predicate count(distinct e)
"biolink:has_participant" "biolink:subclass_of" 1345388
"biolink:in_taxon" "biolink:subclass_of" 508681
"biolink:related_to" "biolink:same_as" 394196
"biolink:gene_associated_with_condition" "biolink:physically_interacts_with" 343268
"biolink:transcribed_from" "biolink:subclass_of" 269941
"biolink:same_as" "biolink:treats" 110222
"biolink:same_as" "biolink:has_participant" 109731
"biolink:same_as" "biolink:related_to" 48679
"biolink:translates_to" "biolink:subclass_of" 48376
"biolink:treats" "biolink:affects" 45164
"biolink:gene_product_of" "biolink:subclass_of" 39585
"biolink:physically_interacts_with" "biolink:subclass_of" 34629
"biolink:same_as" "biolink:gene_associated_with_condition" 28875
"biolink:related_to" "biolink:subclass_of" 22944
"biolink:same_as" "biolink:located_in" 15798
"biolink:part_of" "biolink:has_part" 7449
"biolink:physically_interacts_with" "biolink:affects" 5641
"biolink:related_to" "biolink:affects" 5614
"biolink:gene_product_of" "biolink:same_as" 2111
"biolink:gene_product_of" "biolink:occurs_in" 2059
"biolink:has_metabolite" "biolink:affects" 1680
"biolink:same_as" "biolink:interacts_with" 952
"biolink:physically_interacts_with" "biolink:physically_interacts_with" 370
"biolink:related_to" "biolink:physically_interacts_with" 370
"biolink:same_as" "biolink:subclass_of" 243
"biolink:physically_interacts_with" "biolink:regulates" 183
"biolink:related_to" "biolink:regulates" 182
"biolink:physically_interacts_with" "biolink:disrupts" 39
"biolink:related_to" "biolink:disrupts" 39
"biolink:same_as" "biolink:occurs_in" 34
"biolink:subclass_of" "biolink:has_part" 29
"biolink:physically_interacts_with" "biolink:related_to" 8
"biolink:related_to" "biolink:related_to" 8
"biolink:same_as" "biolink:physically_interacts_with" 1

not all of these appear conflicting, but many do. for instance:

amykglen commented 1 year ago

realized #265 is probably related (possibly due to the same problem?)

saramsey commented 1 year ago

I'm investigating, looking at artifacts on buildkg2.rtx.ai

saramsey commented 1 year ago

Confirmed, on buildkg2.rtx.ai, which is the build system for KG2.8.2pre, I am seeing a bad edge in /home/ubuntu/kg2-build/TSV/edges.tsv:

ubuntu@ip-172-31-50-177:~/kg2-build/TSV$ grep 'DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:' edges.tsv
biolink:gene_associated_with_condition  DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:  False   UMLS:C3652692   same_as infores:drugcentral     {}              same_as biolink:same_as DrugCentral:4423    2022-08-22 13:25:53.607 biolink:same_as DrugCentral:4423    UMLS:C3652692
ubuntu@ip-172-31-50-177:~/kg2-build/TSV$ cat edges_header.tsv
core_predicate  id  negated :END_ID predicate_label primary_knowledge_source    publications:string[]   publications_info   qualified_object_aspect qualified_object_direction  qualified_predicate relation_label  source_predicate    :START_ID   update_date predicate:TYPE  subject object
saramsey commented 1 year ago

Running this command now, to peek at that edge in the kg2-simplified.json file:

jq . kg2-simplified.json | grep -C 100 'DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:' > badedge.txt
saramsey commented 1 year ago

Per this resource, evidently, the edge property that we were intending to call core predicate should be called predicate: https://github.com/biolink/biolink-model/blob/master/guidelines/association-examples-with-qualifiers.md

(as was speculated during the AHM today)

saramsey commented 1 year ago

Filtering to find a specific edge object in kg2-simplified.json with a specific combination of subject and object CURIES:

jq '.edges|map(select(.subject=="DrugCentral:4423" and .object=="UMLS:C3652692"))' kg2-simplified.json > test.json
saramsey commented 1 year ago

So, it appears that this issue also affected KG2.8.1pre:

Screen Shot 2023-05-10 at 11 37 17 AM
saramsey commented 1 year ago

The issue is in fact upstream of the TSV export step. Running this command:

jq '.edges|map(select(.subject=="DrugCentral:4423" and .object=="UMLS:C3652692"))' kg2-simplified.json > test.json

produces:

cat test.json
[
  {
    "subject": "DrugCentral:4423",
    "object": "UMLS:C3652692",
    "relation_label": "same_as",
    "source_predicate": "biolink:same_as",
    "qualified_predicate": null,
    "qualified_object_aspect": null,
    "qualified_object_direction": null,
    "negated": false,
    "publications": [],
    "publications_info": {},
    "update_date": "2022-08-22 13:25:53.607",
    "id": "DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:",
    "core_predicate": "biolink:gene_associated_with_condition",
    "predicate_label": "same_as",
    "primary_knowledge_source": "infores:drugcentral"
  }
]
saramsey commented 1 year ago

This command:

jq . kg2-drugcentral.json  | grep -C 20 'DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:'

returns:

    {
      "subject": "DrugCentral:4423",
      "object": "UMLS:C3652692",
      "relation_label": "same_as",
      "source_predicate": "biolink:same_as",
      "qualified_predicate": null,
      "qualified_object_aspect": null,
      "qualified_object_direction": null,
      "negated": false,
      "publications": [],
      "publications_info": {},
      "update_date": "2022-08-22 13:25:53.607",
      "knowledge_source": "DrugCentral:",
      "id": "DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:"
    },
saramsey commented 1 year ago

At this point, @acevedol and I are pretty convinced that this problem is being introduced in the filter_kg_and_remap_predicates.py step of the build process (though we think there is also an unrelated code bug in kg_json_to_tsv.py that we discovered by investigating this issue).

amykglen commented 1 year ago

any update on this or expected timeline for fixing KG2.8.2pre, @acevedol @saramsey?

amykglen commented 1 year ago

from @acevedol at the AHM today: test builds of the fixed KG2.8.2pre have completed successfully and she kicked off a full build today

acevedol commented 1 year ago

I ran the query from above against kg2.8.4 match (n)-[e]->(m) where e.source_predicate starts with "biolink:" return e.source_predicate, e.core_predicate, count(distinct e) order by count(distinct e) desc

e.source_predicate | e.core_predicate | count(distinct e) -- | -- | -- "biolink:has_participant" | null | 1345388 "biolink:in_taxon" | null | 559462 "biolink:related_to" | null | 423888 "biolink:gene_associated_with_condition" | null | 342528 "biolink:same_as" | null | 310992 "biolink:transcribed_from" | null | 269941 "biolink:translates_to" | null | 48374 "biolink:treats" | null | 47697 "biolink:gene_product_of" | null | 43649 "biolink:physically_interacts_with" | null | 42187 "biolink:part_of" | null | 7662 "biolink:causes" | null | 6724 "biolink:has_metabolite" | null | 1680 "biolink:subclass_of" | null | 29

I am not sure if the null core predicates are expected behavior.

acevedol commented 1 year ago

I forgot to correct the "core predicate" to "predicate" in the query to reflect the current KG2. The query results are

e.source_predicate | e.predicate | count(distinct e) -- | -- | -- "biolink:has_participant" | "biolink:has_participant" | 1345388 "biolink:in_taxon" | "biolink:subclass_of" | 559462 "biolink:related_to" | "biolink:related_to" | 423888 "biolink:gene_associated_with_condition" | "biolink:gene_associated_with_condition" | 342528 "biolink:same_as" | "biolink:same_as" | 310992 "biolink:transcribed_from" | "biolink:transcribed_from" | 269941 "biolink:translates_to" | "biolink:translates_to" | 48374 "biolink:treats" | "biolink:treats" | 47697 "biolink:gene_product_of" | "biolink:gene_product_of" | 43649 "biolink:physically_interacts_with" | "biolink:physically_interacts_with" | 42187 "biolink:part_of" | "biolink:has_part" | 7662 "biolink:causes" | "biolink:causes" | 6724 "biolink:has_metabolite" | "biolink:has_metabolite" | 1680 "biolink:subclass_of" | "biolink:has_part" | 29

Which doesn't appear to have any conflicts. Closing this issue.