RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
38 stars 8 forks source link

need to improve filter_kg_and_remap_predicates.py to check subject/object category rules #125

Open saramsey opened 3 years ago

saramsey commented 3 years ago

@chunyuma reported the following issue:

I have a question regarding the predicate “biolink:treats” in our kg2. I found that the “biolink:treats” predicate seems not a relation only existing between “drug” entities (e.g. ‘’biolink:Drug”, “biolink:SmallMolecule”, “biolink:ChemicalEntity”) and “disease” entities (e.g. “biolink:Disease”, “biolink:PhenotypicFeature”, “biolink:DiseaseOrPhenotypicFeature”). It might also exist between “biolink:Procedure” and “biolink:OrganismTaxon”; “biolink:biolink:Activity” and “biolink:GrossAnatiomicalStructure”; “biolink:NamedThing” and “biolink:Cell” and etc. And it seems like most of triples containing this relation are from SEMMEDDB.

So I basically have following two questions regarding this observation:

  1. Can I consider that the predicate “biolink:treats” used in our kg2 might present a general semantic meaning of “treatment” rather than specifically the treatment between drug and disease?

  2. Do we have any specific rules to assign the “biolink:treats” label to a relation when we processed the raw data source and integrate them in kg2? Or we just depends on the original “treats” relation category used in the raw data?

Thanks for your time for helping me answer these questions in advance.

Best, Chunyu

Right now, semmeddb:treats is mapped to biolink:treats without regard to the subject or object categories. That will have to change, to bring KG2 into full Biolink compliance.

I imagine we could update filter_kg_and_remap_predicates.py to load the Biolink model, and compile a map whose keys are combinations of subject-category, predicate, and object-category (and whose values are indexes of edges corresponding to each such combination). There may be perhaps ten thousand such combinations, but that's OK. Then, it could go through each such combination and check the Biolink rules to see if the subject category and object category are "allowed" for the predicate type. If not, the next parent predicate in the Biolink hierarchy could be checked, and so on, until (if necessary) we get to biolink:related_to whose subject category and object category are the general-purpose named thing. A warning could be issued for each predicate that violates Biolink rules. Then go through and update any edge predicates that need to be updated.

chunyuma commented 3 years ago

Hi @saramsey, thanks for opening this issue and paying attention to it. Except for biolink:treats, do we also have to check if other predicates also have this similar problem where the subject category or object category of a predicate doesn't fully comply with Biolink model? Thanks!

ecwood commented 1 year ago

This ties into #281 (though that is more limited in scope and will occur at a different place in the build process). Overall, it would probably be good to consider the domain and range on the different predicates in the Biolink model.

One option, which is similar to what Steve suggested, is to have a hierarchy (tree) of allowed predicates for each subject-object pairing in KG2. Based on the assigned predicate, it could then assign the most specific predicate that fits.

ecwood commented 1 year ago

As of KG2.8.3, there are 1554 subject-object category pairs and 10439 subject-predicate-object pairings.

Here are the results of

match (n)-[]->(m) return distinct n.category, m.category

edge_category_pairings.csv

match (n)-[e]->(m) return distinct n.category, e.predicate, m.category

edge_category_predicate_pairings.csv

on KG2.8.3.

saramsey commented 1 year ago

Hi @saramsey, thanks for opening this issue and paying attention to it. Except for biolink:treats, do we also have to check if other predicates also have this similar problem where the subject category or object category of a predicate doesn't fully comply with Biolink model? Thanks!

Hi Chunyu, in general, I think the answer is "yes", there are many predicates in Biolink for which there are restrictions on the allowed categories for subject and or object. I think, in fact, it's the dominant case. I suspect most predicates in the biolink-model.yaml file have domain and or range constraints (which show up as domain or range keys under the predicate's entry in biolink-model.yaml). It's just that for the most part, the KG2pre build system pays no attention to those constraints. Right, @acevedol and @ecwood ?

ecwood commented 1 year ago

Hi @saramsey, yes, the KG2pre build system currently ignores those domain and range constraints. Our biolink validation is limited in scope overall, but certainly with regards to domain and range.

saramsey commented 1 year ago

Thank you for confirming, @ecwood. That's a good long-term goal for the RTX-KG2 project (to more systematically enforce those constraints).