With PR #211, we now include taxa in the clique and synonym information. In order to fully implement #155, we will need to refactor compendium and synonym files to support multiple properties side-by-side.
Compendium files
I propose that Compendium files should have a p slot for each identifier and properties slot for the entire clique. Here is an example:
{
"type": "biolink:Gene",
"identifiers": [
{ "i": "NCBIGene:5367", "l": "PMCH", "p": { "in_taxon": ["NCBITaxon:9606"] }},
{ "i": "ENSEMBL:ENSG00000183395", "p": {}},
{ "i": "HGNC:9109", "l": "PMCH", "p": {}},
{ "i": "OMIM:176795", "p": { "description": "The melanin-concentrating hormone (MCH) is a cyclic neuropeptide isolated initially from salmon pituitary gland and later from rat hypothalamus. In mammals, MCH perikarya are confined largely to the lateral hypothalamus and zona incerta area with extensive neuronal projections throughout the brain, including the neurohypophysis. The anatomic distribution suggests a neurotransmitter or neuromodulator role for MCH in a broad array of neuronal functions directed toward the regulation of goal-directed behavior, such as food intake, and general arousal. MCH and 2 other putative neuropeptides, NEI and NGE, are encoded by the same precursor and appear colocalized in nerve cells and in many instances within the projections. The precursor is designated pro-melanin-concentrating hormone (PMCH) (summary by [Nahon et al., 1992](https://www.omim.org/entry/176795#3))." }},
{ "i": "UMLS:C1418669", "l": "PMCH gene", "p": {}}
],
"properties": {
"description": "The melanin-concentrating hormone (MCH) is a cyclic neuropeptide isolated initially from salmon pituitary gland and later from rat hypothalamus. In mammals, MCH perikarya are confined largely to the lateral hypothalamus and zona incerta area with extensive neuronal projections throughout the brain, including the neurohypophysis. The anatomic distribution suggests a neurotransmitter or neuromodulator role for MCH in a broad array of neuronal functions directed toward the regulation of goal-directed behavior, such as food intake, and general arousal. MCH and 2 other putative neuropeptides, NEI and NGE, are encoded by the same precursor and appear colocalized in nerve cells and in many instances within the projections. The precursor is designated pro-melanin-concentrating hormone (PMCH) (summary by [Nahon et al., 1992](https://www.omim.org/entry/176795#3)).",
"in_taxon": ["NCBITaxon:9606"],
"information_content": "100"
}
}
The property keys should be documented in Babel, but each property should be mappable to an RDF property for exports:
We currently store descriptions and information content in ideqids. We should separate them into a separate database and use that to store the properties as a JSON object indexed by the primary identifier. That will allow us to look it up quickly once we've resolved the identifier to return to the user.
Synonym files
Similarly, in synonym files, properties should be referred to in the same way as Compendium files: a single properties key that has a dictionary of key/value pairs. Since we get rid of internal clique information, we only store clique-level properties here.
[ ] Need to figure out if we can index properties.curie_suffix in Solr in the same way that we currently index curie_suffix.
In KGX, we can export this as node properties (e.g. {"id": "NCBIGene:5367", "name": "PMCH", "category": "biolink:Gene", "equivalent_identifiers": ["NCBIGene:5367", ...], "in_taxa": ["NCBITaxon:9606"], ...}).
We will need to handle properties that don't have Biolink equivalents, like information content.
In SSSOM, this information can be stored as pipe-delimited key-value pairs in the other slot.
With PR #211, we now include taxa in the clique and synonym information. In order to fully implement #155, we will need to refactor compendium and synonym files to support multiple properties side-by-side.
Compendium files
I propose that Compendium files should have a
p
slot for each identifier andproperties
slot for the entire clique. Here is an example:The property keys should be documented in Babel, but each property should be mappable to an RDF property for exports:
description
=biolink:description
for a description of the concept.in_taxon
=biolink:in_taxon
to indicate the taxa that the concept is found in.information_content
=vocab:normalizedInformationContent
as the information content of the clique.In Redis
We currently store descriptions and information content in ideqids. We should separate them into a separate database and use that to store the properties as a JSON object indexed by the primary identifier. That will allow us to look it up quickly once we've resolved the identifier to return to the user.
Synonym files
Similarly, in synonym files, properties should be referred to in the same way as Compendium files: a single
properties
key that has a dictionary of key/value pairs. Since we get rid of internal clique information, we only store clique-level properties here.properties.curie_suffix
in Solr in the same way that we currently indexcurie_suffix
.Other exports
{"id": "NCBIGene:5367", "name": "PMCH", "category": "biolink:Gene", "equivalent_identifiers": ["NCBIGene:5367", ...], "in_taxa": ["NCBITaxon:9606"], ...}
).