Gene Disease associations

andrewsu commented 4 years ago

This is likely a very naive question... I can imagine a few ways that genes could be related to diseases. For example: Gene mutated_in Disease, Gene overexpressed_in Disease, Gene underexpressed_in Disease. Would those types of relationships be in scope of the Biolink model? If so, would they theoretically be added as children of https://biolink.github.io/biolink-model/docs/GeneToDiseaseAssociation.html or somewhere else?

mbrush commented 4 years ago

Hi Andrew, and thanks for moving this question this over from our Slack chat - as I think it speaks to a common question about what relationships to create as Associations vs Slots, and how these are related.

First, I do think that the types of relationships you describe would be in scope for the Biolink Model (BLM). W.r.t. where/how to created them in the BLM, a distinction to keep in mind is that between an Association (aka an edge) and a Slot (aka a property).

Associations in the BLM are relationship reifications that provide schema for representing the semantics of what is stated as true in the association, and also the provenance/evidence supporting it. Associations are instantiated in the data (e.g. an edge in a KG) to capture an assertion of domain knowledge, and the E/P supporting it. The schema for a GenetoDiseaseAssociation is here.

Slots/properties, by contrast, are simply named predicates used to capture properties of an Entity or an Association. They are used to capture the semantics of the relationship asserted in an Association instance.

I think for your example, we would want to create new Biolink slots for each gene-disease relationship we think might be useful to describe in a GenetoDiseaseAssociation instance. These slots would descend from the root BLM slot/property 'related to'. And more specifically, they might be direct children of the 'gene associated with condition' slot that itself is a child of 'related to'. Also, these BLM slots could be mapped where possible to properties in the Relation Ontology that may already cover some of the relationships you have in mind (e.g. here).

Then, in the context of a GenetoDiseaseAssociation instance, one of these slots/properties (e.g. 'gene_mutated_in_disease') would be used as the value of the Association's 'relation' slot - to indicate that this is the specific type of gene-disease relationship being asserted. For example:

ex:associationinstance001
  - association_type: biolink:GeneToDiseaseAssociation
  - subject: ex:gene001
  - relation: 'gene_mutated_in_disease'
  - object: ex:disease001

When useful, a more generic Association type such as GenetoDiseaseAssociation can be specialized to provide a schema for representing a more specific type of gene-disease relationship (e.g. https://biolink.github.io/biolink-model/docs/GeneHasVariantThatContributesToDiseaseAssociation.html). But in general I think the granularity that BLM advocates for Association type specialization is not as fine as that for slots. The idea being that Associations define general categories of associations (e.g. GenetoDiseaseAssociation), and the 'relation' slot of such associations can take one of possibly many slots/properties that describe a more specific relationship asserted in the association instance (e.g. gene_mutations_cuasal_for, gene_overexpressed_in_disease, gene_mutations_increase_susceptability_to, etc.)

andrewsu commented 4 years ago

Hi Matt, thanks, very helpful! I think I understand more, but will cogitate more and probably follow up with more questions.

But before that, I have very practical suggestion for your consideration. I think for many people (including some in Translator), the somewhat theoretical explanation above might still pose a mental roadblock to adopting BLM. I think if some people can't find the right relationship type to use in their graph right off the bat, then they're going to put BLM on the back burner. So my suggestion would be for the BLM team to more proactively create the most common relationships that are likely to be used (in Translator, for example).

As one possible model for how to do this, the table below shows all 49 "meta-edges" from SemMedDB that are used at least 100k times. What about prospectively evaluating these for inclusion in BLM? (Seems like it might be a good "Segment 1" activity for the Standards team for the current FOA?) Or do many/most of the meta-edges below already exist and I'm just missing them?

start_label	type	end_label	count
Anatomy	LOCATION_OF	Chemicals & Drugs	9E+05
Anatomy	LOCATION_OF	Disorders	5E+05
Anatomy	LOCATION_OF	Genes & Molecular Sequences	4E+05
Anatomy	PART_OF	Anatomy	2E+05
Anatomy	PART_OF	Living Beings	2E+05
Anatomy	PRODUCES	Chemicals & Drugs	1E+05
Chemicals & Drugs	AFFECTS	Anatomy	2E+05
Chemicals & Drugs	AFFECTS	Disorders	2E+05
Chemicals & Drugs	AFFECTS	Physiology	4E+05
Chemicals & Drugs	ASSOCIATED_WITH	Disorders	4E+05
Chemicals & Drugs	AUGMENTS	Anatomy	2E+05
Chemicals & Drugs	AUGMENTS	Disorders	1E+05
Chemicals & Drugs	AUGMENTS	Physiology	2E+05
Chemicals & Drugs	CAUSES	Disorders	3E+05
Chemicals & Drugs	COEXISTS_WITH	Chemicals & Drugs	6E+05
Chemicals & Drugs	COEXISTS_WITH	Genes & Molecular Sequences	3E+05
Chemicals & Drugs	DISRUPTS	Physiology	2E+05
Chemicals & Drugs	INHIBITS	Chemicals & Drugs	5E+05
Chemicals & Drugs	INHIBITS	Genes & Molecular Sequences	2E+05
Chemicals & Drugs	INTERACTS_WITH	Chemicals & Drugs	1E+06
Chemicals & Drugs	ISA	Chemicals & Drugs	1E+05
Chemicals & Drugs	PART_OF	Anatomy	3E+05
Chemicals & Drugs	PART_OF	Chemicals & Drugs	1E+05
Chemicals & Drugs	PART_OF	Genes & Molecular Sequences	1E+05
Chemicals & Drugs	PART_OF	Living Beings	2E+05
Chemicals & Drugs	PREVENTS	Disorders	1E+05
Chemicals & Drugs	STIMULATES	Chemicals & Drugs	5E+05
Chemicals & Drugs	TREATS	Disorders	5E+05
Chemicals & Drugs	TREATS	Living Beings	1E+05
Chemicals & Drugs	compared_with	Chemicals & Drugs	2E+05
Disorders	CAUSES	Disorders	2E+05
Disorders	COEXISTS_WITH	Disorders	6E+05
Disorders	PROCESS_OF	Living Beings	6E+05
Genes & Molecular Sequences	AFFECTS	Physiology	2E+05
Genes & Molecular Sequences	ASSOCIATED_WITH	Disorders	3E+05
Genes & Molecular Sequences	COEXISTS_WITH	Genes & Molecular Sequences	1E+05
Genes & Molecular Sequences	INTERACTS_WITH	Chemicals & Drugs	4E+05
Genes & Molecular Sequences	INTERACTS_WITH	Genes & Molecular Sequences	2E+05
Genes & Molecular Sequences	PART_OF	Anatomy	2E+05
Genes & Molecular Sequences	PART_OF	Living Beings	1E+05
Genes & Molecular Sequences	STIMULATES	Chemicals & Drugs	3E+05
Genes & Molecular Sequences	STIMULATES	Genes & Molecular Sequences	1E+05
Living Beings	LOCATION_OF	Chemicals & Drugs	4E+05
Living Beings	LOCATION_OF	Genes & Molecular Sequences	2E+05
Physiology	PROCESS_OF	Living Beings	1E+05
Procedures	DIAGNOSES	Disorders	2E+05
Procedures	METHOD_OF	Procedures	2E+05
Procedures	TREATS	Disorders	4E+05
Procedures	USES	Chemicals & Drugs	3E+05

mbrush commented 4 years ago

Hi Andrew. I agree that in general we should come up with some documentation aimed at helping specific users (e.g. Translator KG developers) navigate the BLM to find the best slots/properties to use in their KGs (and match these slots with the most appropriate association type that provides a schema/context in which to use this slot to capture domain knowledge). I think there are features of the BLM that may help address some of your concerns, but they not may not be well advertised or presented in documentation in the most accessible way.

For example, the BLM does provide an in_subset slot that can be used with the value "translator_minimal" to tag slots for use by Translator e.g. here. There are over 100 slots currently tagged in this way in the yaml file - but I don’t think there is a great interface for BLM users to see this.

As for the SemMedDB predicate set you provide, a quick check of BLM mappings revealed that nearly all of these (the exceptions being AUGMENTS, DIAGNOSES, METHOD_OF, COMPARED_WITH) are currently mapped to an existing BLM slot (e.g. for the mapping for INTERACTS_WITH is here). At present we use a fairly generic mappings slot to capture these, but we are working toward provide a larger set of mappings slots with more precise semantics (e.g. exact, narrower, broader matches)

Finally, I have yet to look at the different activities proposed in the Translator FOA, but agree that these types of enhancements to the BLM and its documentation should be pursued. Accounts of challenges and feedback that you and others provide will be very useful to inform these efforts. Thanks!

andrewsu commented 4 years ago

Ahh right, thanks for the reminder that those mappings exist under related to. And that also clarifies in my mind your previous comment/suggestion about gene associated with condition. and it also reminds me of this other related ticket https://github.com/biolink/biolink-model/issues/251 which exactly proposes improving how the BLM subsets can be viewed.

I think I also got a bit distracted by steering this down the semmeddb path (which you point out is already reasonably well covered). My original motivation for this ticket was in the context of this nascent database of drug mechanisms (https://zenodo.org/record/3515487#.XbcvNehKguV and https://github.com/SuLab/indication_moa_db/tree/1.0, created by @mmayers12). I've pasted in the table of the most common meta-edges in that resource. At some point in the next few months, can we get your help figuring out which of these relations already exists in BLM and which need to be created? We will be expanding this resource over time, and would like to make sure we're aligned with BLM before we go too much further...

metaedge	count
Drug - INHIBITS - Protein	67
Protein - INVOLVED_IN - Biological Process	38
Biological Process - CAUSES - Disease	29
Taxon - CAUSES - Disease	26
Drug - ACTIVATES - Protein	24
Biological Process - REQUIRED_FOR - Taxon	24
Protein - UP_REGULATES - Biological Process	14
Biological Process - ELEVATED_IN - Disease	11
Biological Process - DISRUPTED_IN - Disease	11
Protein - PRODUCES - Compound Class	9
Protein - DOWN_REGULATES - Protein	9
Compound - PART_OF - Biological Process	8
Biological Process - ASSOCIATED_WITH - Disease	7
Protein - UP_REGULATES - Protein	7
Protein - DOWN_REGULATES - Biological Process	7
Drug - INCREASES - Compound	6
Protein - PRODUCES - Compound	5
Biological Process - REDUCES - Disease	5

nlharris commented 3 years ago

not sure if this is related to work the EPC or Predicates WGs are doing

sierra-moxon commented 2 years ago

@andrewsu - from your last post, I think most/all of these predicates are in the model (especially with PR #844). Please reopen if I mischaracterized the doneness of this. I think we can definitely add predicates to the predicate hierarchy to help folks find things they need in the model.

biolink / biolink-model

Gene Disease associations #282