biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
172 stars 71 forks source link

Located in / Part of Predicates #53

Closed stuppie closed 4 years ago

stuppie commented 6 years ago

Can we add located in and part of ?

located in: http://purl.obolibrary.org/obo/RO_0001025 part of: http://purl.obolibrary.org/obo/BFO_0000050

I'm not quite sure how these should be structured.. As in, how they are related to each other and the other existing predicates in biolink (such as coexists with).

mbrush commented 6 years ago

yes, these are slated to go in with a batch of new predicates informed by integrating/aligning with several new knowledge sources. you can see what some of these are here: https://docs.google.com/document/d/1FeJRSUWSg5NzmBy4w3Fn8U7xaUwcqFtJSCEbdkG0hnY/edit#heading=h.1cqyiry28ssz

cmungall commented 6 years ago

Some things to keep in mind

In RO, located-in is strictly IC-IC (not sure what happened to the domain/range axioms). Not sure we want to bring in these very restrictive BFO usages. I estimate people will want a broad relation that can be used for protein-subcellular, disease-anatomy, pathway-cell, etc. How does wikidata do this?

part-of is either c-c or o-o. I think it's ok to keep this

stuppie commented 6 years ago

The properties are here: location (P276) part of (P361)

I think the only usages of these in Wikidata for biomedical items are for GO cellular components. Example: mitochondrial ribosome -> part of -> mitochondrial matrix

mbrush commented 6 years ago

My concern with location and part relations, if not tightly specified, is that the same data types will get captured using different relations.

For example, I see that the 'location' predicate in WD is used to link diseases to anatomical entities that they affect. In other sources (e.g, SemMedDB), this type of association may be captured using predicates like part_of or affects, that would get mapped to different BLM predicates than the 'location' predicate.

I suspect this type of problem will be common in building an uber-KG, given the heterogeneity in and under-specification of predicates across sources. It may be worth considering cases where important associations could be subject to this type of inconsistency, and identifying ways of preventing this. If we rely only on predicate mappings, we are likely to encounter these types of problems, which could have significant negative consequences.

mbrush commented 6 years ago

A related issue here for @RichardBruskiewich - querying SemMedDB derived statements in the KBA I see use of the WD:location property that seems to be in the wrong direction. WD:'location' means 'is_located_in' - so it is used to link the smaller object to the larger object in or on which it is located.
But in the KBA I see statements like "serum - location - EGFR gene", where WD:location is interpreted as "is_location_of". The correct statement, should be "EGFR gene - location - serum".

Beside this point - the statement that a gene is located in serum is strange as serum lacks cells or DNA - perhaps what is meant is that the gene product is found here? But this may be another example speaking to my point above - as there may be statements about gene expression in tissues from SemMedDB that are structured using a location predicate, whereas other sources would use expressed_in. Clearly this is problematic from an integration perspective, and would not be addressed in the context-naive mappings we have been making so far.

RichardBruskiewich commented 6 years ago

Hi Matt,

Thank you for picking this up.

I've quietly asked Ben Good about this since he curated this wikidata property mappings to SemMedDb statements in 2016. I guess we should see if this is a systematic inconsistency in predicate intent in the whole data set.

It could also conceivably be something inconsistent or limitation in the original SemMedDb mappings as well, although I think that SemMedDb is text mined in a fairly mechanistic fashion from the literature. I wonder if the literature is correctly parsed all the time. BTW, for a given statement, you can look at the "evidence" which are the original Pubmed citations. Scrutinizing some of these may help.

I don't know if Andrew's team are encountering similar issues with their effort to digest SemMedDb. It might be interesting to compare notes.

As for the EGFR example, that certainly does look weird. We should check the associated PubMed citation to see what may be going on.

If the "location" error is systematic, I'd guess that it would be feasible to fix it in RKB with some creative Cypher. If there is a deeper semantic issue (location versus expression) and some weird mappings, harder to fix (except on a case-by-case basis).

All that said, given that our Knowledge.Bio 3.0 (pre-NCATS) digest of SemMedDb is both out-of-date (SemMedDb June 2016), derived and was mainly spun off as a beacon as a "proof-of-concept" Beacon, it is tempting to consider replacing it with a more up-to-date cleanly curated SemMedDb dataset sometime soon.

fractaler commented 6 years ago

"mitochondrial ribosome -> part of -> mitochondrial matrix": may be "mitochondrial ribosome" is "mitochondrial matrix component"?

cmungall commented 6 years ago

@fractaler this touches on deductive ontology reasoning (see primer). Sometimes you need to follow the isa relationships to get the inherited properties from any given node. For example, in MONDO we have 'peroxisomal disease -[disrupts]->peroxisome', but there is no direct link from Zellweger to peroxisome, you need to traverse up the class hierarchy to collect all properties (this is a simplification but it gets you most of the way there). Perhaps some kind of ontology reasoning service would be useful here @balhoff

fractaler commented 6 years ago

@cmungall, thank you for links and example. "Peroxisome" is a homonym: 1) normal peroxisome, 2) disrupted peroxisome. Peroxisomal disease also is a homonym: 1) "Peroxisomal disease (process)", 2) "Peroxisomal disease (result)". If we proceed from the spiral world model "process -> result", then we have: "disrupted peroxisome" is "result of process" - "peroxisomal disease (process)". Peroxisomal disease (result)" is "inherited metabolic disorder (result)" is "result of process" - "process of using DNA with damage".

RichardBruskiewich commented 4 years ago

It seems like the two originally requested biolink predicates were actually added at some point.