Open saramsey opened 3 years ago
Hi @saramsey, thanks for opening this issue and paying attention to it. Except for biolink:treats
, do we also have to check if other predicates also have this similar problem where the subject category or object category of a predicate doesn't fully comply with Biolink model? Thanks!
This ties into #281 (though that is more limited in scope and will occur at a different place in the build process). Overall, it would probably be good to consider the domain
and range
on the different predicates in the Biolink model.
One option, which is similar to what Steve suggested, is to have a hierarchy (tree) of allowed predicates for each subject
-object
pairing in KG2. Based on the assigned predicate, it could then assign the most specific predicate that fits.
As of KG2.8.3
, there are 1554 subject
-object
category pairs and 10439 subject
-predicate
-object
pairings.
Here are the results of
match (n)-[]->(m) return distinct n.category, m.category
match (n)-[e]->(m) return distinct n.category, e.predicate, m.category
edge_category_predicate_pairings.csv
on KG2.8.3
.
Hi @saramsey, thanks for opening this issue and paying attention to it. Except for
biolink:treats
, do we also have to check if other predicates also have this similar problem where the subject category or object category of a predicate doesn't fully comply with Biolink model? Thanks!
Hi Chunyu, in general, I think the answer is "yes", there are many predicates in Biolink for which there are restrictions on the allowed categories for subject
and or object
. I think, in fact, it's the dominant case. I suspect most predicates in the biolink-model.yaml
file have domain
and or range
constraints (which show up as domain
or range
keys under the predicate's entry in biolink-model.yaml
). It's just that for the most part, the KG2pre build system pays no attention to those constraints. Right, @acevedol and @ecwood ?
Hi @saramsey, yes, the KG2pre build system currently ignores those domain and range constraints. Our biolink validation is limited in scope overall, but certainly with regards to domain and range.
Thank you for confirming, @ecwood. That's a good long-term goal for the RTX-KG2 project (to more systematically enforce those constraints).
@chunyuma reported the following issue:
Right now, semmeddb:treats is mapped to biolink:treats without regard to the subject or object categories. That will have to change, to bring KG2 into full Biolink compliance.
I imagine we could update
filter_kg_and_remap_predicates.py
to load the Biolink model, and compile a map whose keys are combinations ofsubject-category
,predicate
, andobject-category
(and whose values are indexes of edges corresponding to each such combination). There may be perhaps ten thousand such combinations, but that's OK. Then, it could go through each such combination and check the Biolink rules to see if the subject category and object category are "allowed" for the predicate type. If not, the next parent predicate in the Biolink hierarchy could be checked, and so on, until (if necessary) we get tobiolink:related_to
whose subject category and object category are the general-purposenamed thing
. A warning could be issued for each predicate that violates Biolink rules. Then go through and update any edge predicates that need to be updated.