Closed amykglen closed 1 year ago
this query seems to find more instances of conflicting source_predicates vs. core_predicates:
match (n)-[e]->(m) where e.source_predicate starts with "biolink:" return e.source_predicate, e.core_predicate, count(distinct e) order by count(distinct e) desc
e.source_predicate | e.core_predicate | count(distinct e) |
---|---|---|
"biolink:has_participant" | "biolink:subclass_of" | 1345388 |
"biolink:in_taxon" | "biolink:subclass_of" | 508681 |
"biolink:related_to" | "biolink:same_as" | 394196 |
"biolink:gene_associated_with_condition" | "biolink:physically_interacts_with" | 343268 |
"biolink:transcribed_from" | "biolink:subclass_of" | 269941 |
"biolink:same_as" | "biolink:treats" | 110222 |
"biolink:same_as" | "biolink:has_participant" | 109731 |
"biolink:same_as" | "biolink:related_to" | 48679 |
"biolink:translates_to" | "biolink:subclass_of" | 48376 |
"biolink:treats" | "biolink:affects" | 45164 |
"biolink:gene_product_of" | "biolink:subclass_of" | 39585 |
"biolink:physically_interacts_with" | "biolink:subclass_of" | 34629 |
"biolink:same_as" | "biolink:gene_associated_with_condition" | 28875 |
"biolink:related_to" | "biolink:subclass_of" | 22944 |
"biolink:same_as" | "biolink:located_in" | 15798 |
"biolink:part_of" | "biolink:has_part" | 7449 |
"biolink:physically_interacts_with" | "biolink:affects" | 5641 |
"biolink:related_to" | "biolink:affects" | 5614 |
"biolink:gene_product_of" | "biolink:same_as" | 2111 |
"biolink:gene_product_of" | "biolink:occurs_in" | 2059 |
"biolink:has_metabolite" | "biolink:affects" | 1680 |
"biolink:same_as" | "biolink:interacts_with" | 952 |
"biolink:physically_interacts_with" | "biolink:physically_interacts_with" | 370 |
"biolink:related_to" | "biolink:physically_interacts_with" | 370 |
"biolink:same_as" | "biolink:subclass_of" | 243 |
"biolink:physically_interacts_with" | "biolink:regulates" | 183 |
"biolink:related_to" | "biolink:regulates" | 182 |
"biolink:physically_interacts_with" | "biolink:disrupts" | 39 |
"biolink:related_to" | "biolink:disrupts" | 39 |
"biolink:same_as" | "biolink:occurs_in" | 34 |
"biolink:subclass_of" | "biolink:has_part" | 29 |
"biolink:physically_interacts_with" | "biolink:related_to" | 8 |
"biolink:related_to" | "biolink:related_to" | 8 |
"biolink:same_as" | "biolink:physically_interacts_with" | 1 |
not all of these appear conflicting, but many do. for instance:
realized #265 is probably related (possibly due to the same problem?)
I'm investigating, looking at artifacts on buildkg2.rtx.ai
Confirmed, on buildkg2.rtx.ai
, which is the build system for KG2.8.2pre, I am seeing a bad edge in /home/ubuntu/kg2-build/TSV/edges.tsv
:
ubuntu@ip-172-31-50-177:~/kg2-build/TSV$ grep 'DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:' edges.tsv
biolink:gene_associated_with_condition DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral: False UMLS:C3652692 same_as infores:drugcentral {} same_as biolink:same_as DrugCentral:4423 2022-08-22 13:25:53.607 biolink:same_as DrugCentral:4423 UMLS:C3652692
ubuntu@ip-172-31-50-177:~/kg2-build/TSV$ cat edges_header.tsv
core_predicate id negated :END_ID predicate_label primary_knowledge_source publications:string[] publications_info qualified_object_aspect qualified_object_direction qualified_predicate relation_label source_predicate :START_ID update_date predicate:TYPE subject object
Running this command now, to peek at that edge in the kg2-simplified.json
file:
jq . kg2-simplified.json | grep -C 100 'DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:' > badedge.txt
Per this resource, evidently, the edge property that we were intending to call core predicate
should be called predicate
:
https://github.com/biolink/biolink-model/blob/master/guidelines/association-examples-with-qualifiers.md
(as was speculated during the AHM today)
Filtering to find a specific edge object in kg2-simplified.json
with a specific combination of subject and object CURIES:
jq '.edges|map(select(.subject=="DrugCentral:4423" and .object=="UMLS:C3652692"))' kg2-simplified.json > test.json
So, it appears that this issue also affected KG2.8.1pre:
The issue is in fact upstream of the TSV export step. Running this command:
jq '.edges|map(select(.subject=="DrugCentral:4423" and .object=="UMLS:C3652692"))' kg2-simplified.json > test.json
produces:
cat test.json
[
{
"subject": "DrugCentral:4423",
"object": "UMLS:C3652692",
"relation_label": "same_as",
"source_predicate": "biolink:same_as",
"qualified_predicate": null,
"qualified_object_aspect": null,
"qualified_object_direction": null,
"negated": false,
"publications": [],
"publications_info": {},
"update_date": "2022-08-22 13:25:53.607",
"id": "DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:",
"core_predicate": "biolink:gene_associated_with_condition",
"predicate_label": "same_as",
"primary_knowledge_source": "infores:drugcentral"
}
]
This command:
jq . kg2-drugcentral.json | grep -C 20 'DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:'
returns:
{
"subject": "DrugCentral:4423",
"object": "UMLS:C3652692",
"relation_label": "same_as",
"source_predicate": "biolink:same_as",
"qualified_predicate": null,
"qualified_object_aspect": null,
"qualified_object_direction": null,
"negated": false,
"publications": [],
"publications_info": {},
"update_date": "2022-08-22 13:25:53.607",
"knowledge_source": "DrugCentral:",
"id": "DrugCentral:4423---biolink:same_as---None---None---None---UMLS:C3652692---DrugCentral:"
},
At this point, @acevedol and I are pretty convinced that this problem is being introduced in the filter_kg_and_remap_predicates.py
step of the build process (though we think there is also an unrelated code bug in kg_json_to_tsv.py
that we discovered by investigating this issue).
any update on this or expected timeline for fixing KG2.8.2pre, @acevedol @saramsey?
from @acevedol at the AHM today: test builds of the fixed KG2.8.2pre have completed successfully and she kicked off a full build today
I ran the query from above against kg2.8.4
match (n)-[e]->(m) where e.source_predicate starts with "biolink:" return e.source_predicate, e.core_predicate, count(distinct e) order by count(distinct e) desc
I am not sure if the null core predicates are expected behavior.
I forgot to correct the "core predicate" to "predicate" in the query to reflect the current KG2. The query results are
e.source_predicate | e.predicate | count(distinct e) -- | -- | -- "biolink:has_participant" | "biolink:has_participant" | 1345388 "biolink:in_taxon" | "biolink:subclass_of" | 559462 "biolink:related_to" | "biolink:related_to" | 423888 "biolink:gene_associated_with_condition" | "biolink:gene_associated_with_condition" | 342528 "biolink:same_as" | "biolink:same_as" | 310992 "biolink:transcribed_from" | "biolink:transcribed_from" | 269941 "biolink:translates_to" | "biolink:translates_to" | 48374 "biolink:treats" | "biolink:treats" | 47697 "biolink:gene_product_of" | "biolink:gene_product_of" | 43649 "biolink:physically_interacts_with" | "biolink:physically_interacts_with" | 42187 "biolink:part_of" | "biolink:has_part" | 7662 "biolink:causes" | "biolink:causes" | 6724 "biolink:has_metabolite" | "biolink:has_metabolite" | 1680 "biolink:subclass_of" | "biolink:has_part" | 29Which doesn't appear to have any conflicts. Closing this issue.
in trying to figure out what property biolink predicates are stored under in KG2.8.2pre, the first edge I looked at is very strange:
it has a
core_predicate
ofbiolink:gene_associated_with_condition
but asource_predicate
ofbiolink:same_as
... that doesn't make sense.and there are many such edges - all of these edges have a
source_predicate
ofbiolink:same_as
:I haven't checked whether edges with
source_predicate
s other thanbiolink:same_as
have this issue, where thesource_predicate
conflicts with thecore_predicate
.