TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
9 stars 2 forks source link

Refactor compendium/synonym format to support additional properties #237

Open gaurav opened 9 months ago

gaurav commented 9 months ago

With PR #211, we now include taxa in the clique and synonym information. In order to fully implement #155, we will need to refactor compendium and synonym files to support multiple properties side-by-side.

Compendium files

I propose that Compendium files should have a p slot for each identifier and properties slot for the entire clique. Here is an example:

{
  "type": "biolink:Gene",
  "identifiers": [
    { "i": "NCBIGene:5367", "l": "PMCH", "p": { "in_taxon": ["NCBITaxon:9606"] }},
    { "i": "ENSEMBL:ENSG00000183395", "p": {}},
    { "i": "HGNC:9109", "l": "PMCH", "p": {}},
    { "i": "OMIM:176795", "p": { "description": "The melanin-concentrating hormone (MCH) is a cyclic neuropeptide isolated initially from salmon pituitary gland and later from rat hypothalamus. In mammals, MCH perikarya are confined largely to the lateral hypothalamus and zona incerta area with extensive neuronal projections throughout the brain, including the neurohypophysis. The anatomic distribution suggests a neurotransmitter or neuromodulator role for MCH in a broad array of neuronal functions directed toward the regulation of goal-directed behavior, such as food intake, and general arousal. MCH and 2 other putative neuropeptides, NEI and NGE, are encoded by the same precursor and appear colocalized in nerve cells and in many instances within the projections. The precursor is designated pro-melanin-concentrating hormone (PMCH) (summary by [Nahon et al., 1992](https://www.omim.org/entry/176795#3))." }},
    { "i": "UMLS:C1418669", "l": "PMCH gene", "p": {}}
  ],
  "properties": {
    "description": "The melanin-concentrating hormone (MCH) is a cyclic neuropeptide isolated initially from salmon pituitary gland and later from rat hypothalamus. In mammals, MCH perikarya are confined largely to the lateral hypothalamus and zona incerta area with extensive neuronal projections throughout the brain, including the neurohypophysis. The anatomic distribution suggests a neurotransmitter or neuromodulator role for MCH in a broad array of neuronal functions directed toward the regulation of goal-directed behavior, such as food intake, and general arousal. MCH and 2 other putative neuropeptides, NEI and NGE, are encoded by the same precursor and appear colocalized in nerve cells and in many instances within the projections. The precursor is designated pro-melanin-concentrating hormone (PMCH) (summary by [Nahon et al., 1992](https://www.omim.org/entry/176795#3)).",
    "in_taxon": ["NCBITaxon:9606"],
    "information_content": "100"
  }
}

The property keys should be documented in Babel, but each property should be mappable to an RDF property for exports:

In Redis

We currently store descriptions and information content in ideqids. We should separate them into a separate database and use that to store the properties as a JSON object indexed by the primary identifier. That will allow us to look it up quickly once we've resolved the identifier to return to the user.

Synonym files

Similarly, in synonym files, properties should be referred to in the same way as Compendium files: a single properties key that has a dictionary of key/value pairs. Since we get rid of internal clique information, we only store clique-level properties here.

{
  "curie": "NCBIGene:5367",
  "preferred_name": "PMCH", 
  "names": ["MCH", "PMCH", "ppMCH", "pro-MCH", "PMCH gene", "prepro-MCH", "MELANIN-CONCENTRATING HORMONE", "pro-melanin concentrating hormone", "pro-melanin-concentrating hormone", "PRO-MELANIN-CONCENTRATING HORMONE", "prepro-melanin-concentrating hormone"],
  "types": ["Gene", "GeneOrGeneProduct", "GenomicEntity", "ChemicalEntityOrGeneOrGeneProduct", "PhysicalEssence", "OntologyClass", "BiologicalEntity", "ThingWithTaxon", "NamedThing", "Entity", "PhysicalEssenceOrOccurrent", "MacromolecularMachineMixin"],
  "properties": {
    "shortest_name_length": 3,
    "curie_suffix": 5367,
    "in_taxa": ["NCBITaxon:9606"]
  }
}

Other exports

gaurav commented 5 days ago

Duplicate of https://github.com/TranslatorSRI/Babel/issues/155.