biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
170 stars 71 forks source link

Gene Disease associations #282

Closed andrewsu closed 2 years ago

andrewsu commented 4 years ago

This is likely a very naive question... I can imagine a few ways that genes could be related to diseases. For example: Gene mutated_in Disease, Gene overexpressed_in Disease, Gene underexpressed_in Disease. Would those types of relationships be in scope of the Biolink model? If so, would they theoretically be added as children of https://biolink.github.io/biolink-model/docs/GeneToDiseaseAssociation.html or somewhere else?

mbrush commented 4 years ago

Hi Andrew, and thanks for moving this question this over from our Slack chat - as I think it speaks to a common question about what relationships to create as Associations vs Slots, and how these are related.

First, I do think that the types of relationships you describe would be in scope for the Biolink Model (BLM). W.r.t. where/how to created them in the BLM, a distinction to keep in mind is that between an Association (aka an edge) and a Slot (aka a property).

Associations in the BLM are relationship reifications that provide schema for representing the semantics of what is stated as true in the association, and also the provenance/evidence supporting it. Associations are instantiated in the data (e.g. an edge in a KG) to capture an assertion of domain knowledge, and the E/P supporting it. The schema for a GenetoDiseaseAssociation is here.

Slots/properties, by contrast, are simply named predicates used to capture properties of an Entity or an Association. They are used to capture the semantics of the relationship asserted in an Association instance.

I think for your example, we would want to create new Biolink slots for each gene-disease relationship we think might be useful to describe in a GenetoDiseaseAssociation instance. These slots would descend from the root BLM slot/property 'related to'. And more specifically, they might be direct children of the 'gene associated with condition' slot that itself is a child of 'related to'. Also, these BLM slots could be mapped where possible to properties in the Relation Ontology that may already cover some of the relationships you have in mind (e.g. here).

Then, in the context of a GenetoDiseaseAssociation instance, one of these slots/properties (e.g. 'gene_mutated_in_disease') would be used as the value of the Association's 'relation' slot - to indicate that this is the specific type of gene-disease relationship being asserted. For example:

ex:associationinstance001
  - association_type: biolink:GeneToDiseaseAssociation
  - subject: ex:gene001
  - relation: 'gene_mutated_in_disease'
  - object: ex:disease001

When useful, a more generic Association type such as GenetoDiseaseAssociation can be specialized to provide a schema for representing a more specific type of gene-disease relationship (e.g. https://biolink.github.io/biolink-model/docs/GeneHasVariantThatContributesToDiseaseAssociation.html). But in general I think the granularity that BLM advocates for Association type specialization is not as fine as that for slots. The idea being that Associations define general categories of associations (e.g. GenetoDiseaseAssociation), and the 'relation' slot of such associations can take one of possibly many slots/properties that describe a more specific relationship asserted in the association instance (e.g. gene_mutations_cuasal_for, gene_overexpressed_in_disease, gene_mutations_increase_susceptability_to, etc.)

andrewsu commented 4 years ago

Hi Matt, thanks, very helpful! I think I understand more, but will cogitate more and probably follow up with more questions.

But before that, I have very practical suggestion for your consideration. I think for many people (including some in Translator), the somewhat theoretical explanation above might still pose a mental roadblock to adopting BLM. I think if some people can't find the right relationship type to use in their graph right off the bat, then they're going to put BLM on the back burner. So my suggestion would be for the BLM team to more proactively create the most common relationships that are likely to be used (in Translator, for example).

As one possible model for how to do this, the table below shows all 49 "meta-edges" from SemMedDB that are used at least 100k times. What about prospectively evaluating these for inclusion in BLM? (Seems like it might be a good "Segment 1" activity for the Standards team for the current FOA?) Or do many/most of the meta-edges below already exist and I'm just missing them?

start_label type end_label count
Anatomy LOCATION_OF Chemicals & Drugs 9E+05
Anatomy LOCATION_OF Disorders 5E+05
Anatomy LOCATION_OF Genes & Molecular Sequences 4E+05
Anatomy PART_OF Anatomy 2E+05
Anatomy PART_OF Living Beings 2E+05
Anatomy PRODUCES Chemicals & Drugs 1E+05
Chemicals & Drugs AFFECTS Anatomy 2E+05
Chemicals & Drugs AFFECTS Disorders 2E+05
Chemicals & Drugs AFFECTS Physiology 4E+05
Chemicals & Drugs ASSOCIATED_WITH Disorders 4E+05
Chemicals & Drugs AUGMENTS Anatomy 2E+05
Chemicals & Drugs AUGMENTS Disorders 1E+05
Chemicals & Drugs AUGMENTS Physiology 2E+05
Chemicals & Drugs CAUSES Disorders 3E+05
Chemicals & Drugs COEXISTS_WITH Chemicals & Drugs 6E+05
Chemicals & Drugs COEXISTS_WITH Genes & Molecular Sequences 3E+05
Chemicals & Drugs DISRUPTS Physiology 2E+05
Chemicals & Drugs INHIBITS Chemicals & Drugs 5E+05
Chemicals & Drugs INHIBITS Genes & Molecular Sequences 2E+05
Chemicals & Drugs INTERACTS_WITH Chemicals & Drugs 1E+06
Chemicals & Drugs ISA Chemicals & Drugs 1E+05
Chemicals & Drugs PART_OF Anatomy 3E+05
Chemicals & Drugs PART_OF Chemicals & Drugs 1E+05
Chemicals & Drugs PART_OF Genes & Molecular Sequences 1E+05
Chemicals & Drugs PART_OF Living Beings 2E+05
Chemicals & Drugs PREVENTS Disorders 1E+05
Chemicals & Drugs STIMULATES Chemicals & Drugs 5E+05
Chemicals & Drugs TREATS Disorders 5E+05
Chemicals & Drugs TREATS Living Beings 1E+05
Chemicals & Drugs compared_with Chemicals & Drugs 2E+05
Disorders CAUSES Disorders 2E+05
Disorders COEXISTS_WITH Disorders 6E+05
Disorders PROCESS_OF Living Beings 6E+05
Genes & Molecular Sequences AFFECTS Physiology 2E+05
Genes & Molecular Sequences ASSOCIATED_WITH Disorders 3E+05
Genes & Molecular Sequences COEXISTS_WITH Genes & Molecular Sequences 1E+05
Genes & Molecular Sequences INTERACTS_WITH Chemicals & Drugs 4E+05
Genes & Molecular Sequences INTERACTS_WITH Genes & Molecular Sequences 2E+05
Genes & Molecular Sequences PART_OF Anatomy 2E+05
Genes & Molecular Sequences PART_OF Living Beings 1E+05
Genes & Molecular Sequences STIMULATES Chemicals & Drugs 3E+05
Genes & Molecular Sequences STIMULATES Genes & Molecular Sequences 1E+05
Living Beings LOCATION_OF Chemicals & Drugs 4E+05
Living Beings LOCATION_OF Genes & Molecular Sequences 2E+05
Physiology PROCESS_OF Living Beings 1E+05
Procedures DIAGNOSES Disorders 2E+05
Procedures METHOD_OF Procedures 2E+05
Procedures TREATS Disorders 4E+05
Procedures USES Chemicals & Drugs 3E+05
mbrush commented 4 years ago

Hi Andrew. I agree that in general we should come up with some documentation aimed at helping specific users (e.g. Translator KG developers) navigate the BLM to find the best slots/properties to use in their KGs (and match these slots with the most appropriate association type that provides a schema/context in which to use this slot to capture domain knowledge). I think there are features of the BLM that may help address some of your concerns, but they not may not be well advertised or presented in documentation in the most accessible way.

For example, the BLM does provide an in_subset slot that can be used with the value "translator_minimal" to tag slots for use by Translator e.g. here. There are over 100 slots currently tagged in this way in the yaml file - but I don’t think there is a great interface for BLM users to see this.

As for the SemMedDB predicate set you provide, a quick check of BLM mappings revealed that nearly all of these (the exceptions being AUGMENTS, DIAGNOSES, METHOD_OF, COMPARED_WITH) are currently mapped to an existing BLM slot (e.g. for the mapping for INTERACTS_WITH is here). At present we use a fairly generic mappings slot to capture these, but we are working toward provide a larger set of mappings slots with more precise semantics (e.g. exact, narrower, broader matches)

Finally, I have yet to look at the different activities proposed in the Translator FOA, but agree that these types of enhancements to the BLM and its documentation should be pursued. Accounts of challenges and feedback that you and others provide will be very useful to inform these efforts. Thanks!

andrewsu commented 4 years ago

Ahh right, thanks for the reminder that those mappings exist under related to. And that also clarifies in my mind your previous comment/suggestion about gene associated with condition. and it also reminds me of this other related ticket https://github.com/biolink/biolink-model/issues/251 which exactly proposes improving how the BLM subsets can be viewed.

I think I also got a bit distracted by steering this down the semmeddb path (which you point out is already reasonably well covered). My original motivation for this ticket was in the context of this nascent database of drug mechanisms (https://zenodo.org/record/3515487#.XbcvNehKguV and https://github.com/SuLab/indication_moa_db/tree/1.0, created by @mmayers12). I've pasted in the table of the most common meta-edges in that resource. At some point in the next few months, can we get your help figuring out which of these relations already exists in BLM and which need to be created? We will be expanding this resource over time, and would like to make sure we're aligned with BLM before we go too much further...

metaedge count
Drug - INHIBITS - Protein 67
Protein - INVOLVED_IN - Biological Process 38
Biological Process - CAUSES - Disease 29
Taxon - CAUSES - Disease 26
Drug - ACTIVATES - Protein 24
Biological Process - REQUIRED_FOR - Taxon 24
Protein - UP_REGULATES - Biological Process 14
Biological Process - ELEVATED_IN - Disease 11
Biological Process - DISRUPTED_IN - Disease 11
Protein - PRODUCES - Compound Class 9
Protein - DOWN_REGULATES - Protein 9
Compound - PART_OF - Biological Process 8
Biological Process - ASSOCIATED_WITH - Disease 7
Protein - UP_REGULATES - Protein 7
Protein - DOWN_REGULATES - Biological Process 7
Drug - INCREASES - Compound 6
Protein - PRODUCES - Compound 5
Biological Process - REDUCES - Disease 5
nlharris commented 3 years ago

not sure if this is related to work the EPC or Predicates WGs are doing

sierra-moxon commented 2 years ago

@andrewsu - from your last post, I think most/all of these predicates are in the model (especially with PR #844). Please reopen if I mischaracterized the doneness of this. I think we can definitely add predicates to the predicate hierarchy to help folks find things they need in the model.