biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
107 stars 47 forks source link

A general solution to the databases as ontologies problem in bioregistry #1104

Open cmungall opened 2 months ago

cmungall commented 2 months ago

There is frequently a need to represent entities from a database as an ontology

See:

There are a lot of factors to condense here but some key points

I propose that the bioregistry datamodel is extended to include inlined sub-records for ontology or KG translations of databases. These subrecords would have additional metadata to indicate the source (3rd party vs official vs quasi-official)

One case would be 3rd party ontology rendering with reminted prefixed IDs:

ncbitaxon:
   url: <official NCBI URL>
   renderings:
     - provider: obo
        type: ontology
        documentation: ...
        subset: COMPLETE
        download_url: <OBO ontology PURL>
        prefixmap:
            NCBITaxon: <OBO PURL>
     - provider: umls
        ...
ncit:
   url: <official NCIT URL>
   renderings:
    -  provider: obo
        type: ontology
        documentation: ...
        subset: COMPLETE
        download_url: <OBO ontology PURL>
        prefixmap:
            NCIT: <OBO PURL>

These renderings could even be first class entries as far as the bioregistry UI is concerned, e.g. obo$NCBITaxon (but obviously this wouldn't be used as a prefix)

Another would be 3rd part ontology renderings where the same prefixes and URL expansions are used:

rhea:
   url: <official RHEA URL>
   renderings:
     - provider: biopragmatics
        type: ontology
        documentation: currently this includes all annotations but this is under discussion https://github.com/biopragmatics/pyobo/issues/170
        subset: COMPLETE

here there is no bespoke prefixmap so the standard RHEA ones would be used.

perhaps controversially:

uniprotkb:
   url: <official uniprot URL>
   renderings:
    -  provider: pr
        type: ontology
        documentation: PRO classes at "species-gene" level generally use same local ID as uniprotkb
        subset: OVERLAP
        bioregistry_entry: pr

here this would be a link between 2 existing overlapping bioregistry entries

This scheme could also be used for KG renderings of databases in formats that are more suited than OWL (e.g. kgx, rdfstar with owlstar semantics)

Note that in cases for entries that are "born" ontologies we would not curate this info, this would be considered a reflexive relation

matentzn commented 2 months ago

I have not absorbed your proposal quite yet, but

bioregistry conflates URLs for humans with semantic URIs

While this is mostly true its not quite true conceptually:

"goche": {
    "contributor": {
      "email": "cthoyt@gmail.com",
      "github": "cthoyt",
      "name": "Charles Tapley Hoyt",
      "orcid": "0000-0003-4423-4370"
    },
    "description": "Represent chemical entities having particular CHEBI roles",
    "download_owl": "https://raw.githubusercontent.com/geneontology/go-ontology/master/src/ontology/imports/chebi_roles.owl",
    "example": "25512",
    "homepage": "https://github.com/geneontology/go-ontology",
    "name": "GO Chemicals",
    "pattern": "^\\d+$",
    "preferred_prefix": "GOCHE",
    "rdf_uri_format": "http://purl.obolibrary.org/obo/GOCHE_$1",
    "references": [
      "https://obo-communitygroup.slack.com/archives/C023P0Z304T/p1638472847049400",
      "https://github.com/geneontology/go-ontology/issues/19535"
    ],
    "repository": "https://github.com/geneontology/go-ontology",
    "synonyms": [
      "go.chebi",
      "go.chemical",
      "go.chemicals"
    ],
    "uri_format": "https://biopragmatics.github.io/providers/goche/$1"
  },

Check rdf_uri_format.

This does not entirely change the issue, just adding an additional layer.