biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
177 stars 73 forks source link

persistent URLs #347

Closed saramsey closed 4 years ago

saramsey commented 4 years ago

I'm not sure where I should post this issue, so I will post it here and hope that it gets taken up in a future meeting of the Translator Data Modeling group.

For KG2 development, we are attempting to use persistent URIs wherever possible for identifying concepts. The Biolink model does not seem (?) to define which persistent URI systems are preferred. That is OK, it is not obvious that there exists one persistent URI system/registry that would work for all situations.

After a bunch of empirical testing, we have settled on a hierarchy of persistent URI remapping services which we are using, with the highest ones most preferred:

  1. identifiers.org (air, bao, bto, chebi, cl, clinicaltrials, doid, ecogene, efo, eo, fma, foodon, go, hgnc, hp, iao, icd (ICD9), icd10, ido, kegg.disease, kegg.pathway, ma, meddra, medgen, mesh, mgi, mod, mp, ncbigene, ncit, obi, oborel, omit, orphanet, pato, po, pombase, pr, rgd, sgd, snomedct, so, taxonomy, uberon, umls, uo, wb, wikidata, zfin)
  2. w3id.org (biolink)
  3. purl.obolibrary.org (bfo, bspo, caro, clo, cp, ddanat, ecto, envo, exo, fao, fbbt, fbdv, gard, mf, mfoem, mfomd, mondo, mpath, nbo, oba, ogms, omim, omimps, oncotree, opl, to, zea, zfa)
  4. purl.org (dc, oban, oborel)
  5. no registry known to us (cgnc, foaf, oio, omop, owl, sio, skos)

Any thoughts on this? Interestingly, some of the registries overlap but do not always agree on the CURIE prefix or CURIE identifier format. We have opted to use CURIE prefixes from the above sources in the above priority order, since identifiers.org seems to be (by far) the most complete and we have found it to be easy to search and its documentation fairly intuitive. Anyhow, we are wondering what other Translator teams are using for their persistent URI mapping needs.

balhoff commented 4 years ago

For any terms that are "semantic web native", I would definitely use the URI given by the creators. For any OBO library ontology, a term URI starts with http://purl.obolibrary.org/obo/. To me, it creates confusion to change the URIs of these terms. From your examples, it would be incorrect for Oncotree or OMIM terms to have OBO PURLs, since these aren't part of the OBO library. But there are several OBO namespaces you listed under identifiers.org which should use OBO PURLs.

Similarly for foaf, owl, dc, and skos; these are published under specific prefixes.

The Biolink model has a file specifying prefix expansions: https://github.com/biolink/biolink-model/blob/master/context.jsonld

It's incomplete and needs some attention, but I think it is a good place to start.

saramsey commented 4 years ago

Thanks for your reply! I appreciate your suggestions, as I am a newbie with semantic web stuff. Regarding OncoTree and OMIM, I should note that I found this in mondo.owl:

<owl:equivalentClass rdf:resource="http://purl.obolibrary.org/obo/ONCOTREE_GINET"/>

similarly, I found this in mondo.owl:

 <owl:equivalentClass rdf:resource="http://purl.obolibrary.org/obo/OMIM_424500"/>

I am not assigning these identifiers OBO PURLs; they seem to come to me that way when I import MONDO.

balhoff commented 4 years ago

Regarding OncoTree and OMIM, I should note that I found this in mondo.owl:

Thanks for pointing that out! This is a bug :-)

https://github.com/monarch-initiative/mondo-ingest/issues/199

saramsey commented 4 years ago

Here is a line from mondo.owl

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/KEGG_05215">

containing a persistent URL, http://purl.obolibrary.org/obo/KEGG_05215, that does not seem to work (takes me to the dreaded http://ontologies.berkeleybop.org/ page). That, plus the fact that this URL doesn't make explicit that the KEGG term is a KEGG pathway (and not, say, a KEGG disease or other semantic type) is why I have opted to use identifiers.org for KEGG pathways; in this case I feel that their CURIE prefix, kegg.pathway, is more informative.

saramsey commented 4 years ago

Thank you for suggesting the context.jsonld file; very helpful and I will study it. Right off the bat, there are a few items that I do not quite understand. For example on line 27,

"ExO": "http://example.org/UNKNOWN/ExO/",

I would have thought that the best purl for an ExO term would be something like

http://purl.obolibrary.org/obo/ExO_0000004

which does in fact work (resolves to the expected page on Ontobee).

saramsey commented 4 years ago

In context.jsonld, on line 34,

      "HGNC": "http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=",

I cannot get that base URL to work (probably a PEBKAC issue). I wonder if it would be preferable to use:

    "hgnc": "https://identifiers.org/hgnc:"

which does work; at least, this URL

https://identifiers.org/hgnc:9967

resolves to what I think is the current page:

https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/9967
balhoff commented 4 years ago

I agree about ExO, I'm not sure why it's in there; it's some sort of placeholder. I think everything else in the file including "UNKNOWN" are also placeholders. I'm not sure what to prefer for HGNC; for many resources identifiers.org may be the best choice if they don't publish any preferred URI form. Hopefully some other folks can weigh in on some of the non-OBO, bio database stuff.

saramsey commented 4 years ago

For any OBO library ontology, a term URI starts with http://purl.obolibrary.org/obo/.

Thanks for the suggestion. I would like to look into following this advice. But I confess I am at a bit of a loss as to how to know if a particular ontology is an "OBO library" ontology or not. Is there a definitive list? I know that OBO Foundry has a list http://www.obofoundry.org/ but I am not sure if that is what you mean by "OBO library" ontology. On the other hand, I guess I could look for any "OBO namespace" URI appearing in one of the many OWL ontologies that I am loading, but that seems fraught since many of the URIs don't resolve. Perhaps I could look at the YAML files in https://github.com/OBOFoundry/purl.obolibrary.org/tree/master/config, but I am not sure if that is the right thing to do either, since there are ontologies in there (e.g., NCIt) that I do not load from OBO but rather from the UMLS distribution. Please pardon my ignorance!

balhoff commented 4 years ago

@saramsey the official OBO registry source is here: https://github.com/OBOFoundry/OBOFoundry.github.io/tree/master/ontology

(the other repo is redirect configs, but it's not exactly the same)

Those are compiled into these files which can be used in software: https://github.com/OBOFoundry/OBOFoundry.github.io/tree/master/registry

NCIT is there because there is an "NCIT OBO edition" that has all the same terms but is structured more in line with OBO conventions. There are also related files that provide some deeper integration with Uberon and other OBOs.

saramsey commented 4 years ago

Thank you, @balhoff! OK, I have already switched to using purl.obolibrary.org (instead of identifiers.org) for the CL and UBERON ontologies and I'm in the process of switching the others.

saramsey commented 4 years ago

OK, here is the list of CURIE prefix to URL mappings that I am now using (it is a YAML file). If I have failed to identify any OBO ontologies that should be mapped to obolibrary PURLs, I'd be grateful for a heads-up.

use_for_bidirectional_mapping:
  -
    AIR: https://identifiers.org/umls/AIR/
  -
    bao: "https://identifiers.org/bao:"
  -
    BFO: http://purl.obolibrary.org/obo/BFO_
  -
    BSPO: http://purl.obolibrary.org/obo/BSPO_
  -
    BTO: http://purl.obolibrary.org/obo/BTO_
  -
    biolink: https://w3id.org/biolink/
  -
    CARO: http://purl.obolibrary.org/obo/CARO_
  -
    CGNC: "http://birdgenenames.org/cgnc/GeneReport?id="
  -
    CHEBI: http://purl.obolibrary.org/obo/CHEBI_
  -
    CL: http://purl.obolibrary.org/obo/CL_
  -
    clinicaltrials: "https://identifiers.org/clinicaltrials:"
  -
    CLO: http://purl.obolibrary.org/obo/CLO_
  -
    CP: http://purl.obolibrary.org/obo/CP_
  -
    dbpedia: http://dbpedia.org/resource/
  -
    dc: http://purl.org/dc/elements/1.1/
  -
    DDANAT: http://purl.obolibrary.org/obo/DDANAT_
  -
    DOID: http://purl.obolibrary.org/obo/DOID_
  -
    ecogene: "https://identifiers.org/ecogene:"
  -
    ECTO: http://purl.obolibrary.org/obo/ECTO_
  -
    efo: "https://identifiers.org/efo:"
  -
    EnsemblGenomes: http://www.ensemblgenomes.org/id/
  -
    ENVO: http://purl.obolibrary.org/obo/envo#
  -
    EO: http://purl.obolibrary.org/obo/EO_
  -
    ExO: http://purl.obolibrary.org/obo/ExO_
  -
    FAO: http://purl.obolibrary.org/obo/FAO_
  -
    FBbt: http://purl.obolibrary.org/obo/FBbt_
  -
    FBgn: http://flybase.org/reports/FBgn
  -
    FBdv: http://purl.obolibrary.org/obo/FBdv_
  -
    FMA: http://purl.obolibrary.org/obo/FMA_
  -
    foaf: http://xmlns.com/foaf/0.1/
  -
    FOODON: http://purl.obolibrary.org/obo/FOODON_
  -
    GARD: http://purl.obolibrary.org/obo/GARD_
  -
    GO: http://purl.obolibrary.org/obo/GO_
  -
    hgnc: "https://identifiers.org/hgnc:"
  -
    HP: http://purl.obolibrary.org/obo/HP_
  -
    iao: http://purl.obolibrary.org/obo/IAO_
  -
    icd: "https://identifiers.org/icd:"
  -
    ICD9: http://purl.obolibrary.org/obo/ICD9_
  -
    identifiers_org_registry: "https://identifiers.org/registry/"
  -
    IDO: http://purl.obolibrary.org/obo/IDO_
  -
    kegg.disease: "https://identifiers.org/kegg.disease:H"
  -
    kegg.pathway: "https://identifiers.org/kegg.pathway:hsa"
  -
    MA: http://purl.obolibrary.org/obo/MA_
  -
    meddra: "https://identifiers.org/meddra:"
  -
    medgen: "https://identifiers.org/medgen:"
  -
    mesh: "https://identifiers.org/mesh:"
  -
    MF: http://purl.obolibrary.org/obo/MF_
  -
    MFOEM: http://purl.obolibrary.org/obo/MFOEM_
  -
    MFOMD: http://purl.obolibrary.org/obo/MFOMD_
  -
    MGI: "https://identifiers.org/MGI:"
  -
    MOD: http://purl.obolibrary.org/obo/MOD_
  -
    MONDO: http://purl.obolibrary.org/obo/MONDO_
  -
    MP: http://purl.obolibrary.org/obo/MP_
  -
    MPATH: http://purl.obolibrary.org/obo/MPATH_
  -
    NBO: http://purl.obolibrary.org/obo/NBO_
  -
    ncbigene: "https://identifiers.org/ncbigene:"
  -
    ncit: "https://identifiers.org/ncit:"
  -
    NPO: http://purl.obolibrary.org/obo/NPO_
  -
    OBA: http://purl.obolibrary.org/obo/OBA_
  -
    OBAN: http://purl.org/oban/
  -
    OBI: http://purl.obolibrary.org/obo/OBI_
  -
    OBO: http://purl.obolibrary.org/obo/
  -
    OBOREL: "http://purl.org/obo/owl/OBO_REL#"
  -
    OGMS: http://purl.obolibrary.org/obo/OGMS_
  -
    OIO: http://www.geneontology.org/formats/oboInOwl#
  -
    OMIM: http://purl.obolibrary.org/obo/OMIM_
  -
    OMIMDiseaseCluster: http://purl.obolibrary.org/obo/DC_
  -
    OMIMPS: http://purl.obolibrary.org/obo/OMIMPS_
  -
    OMIT: http://purl.obolibrary.org/obo/OMIT_
  -
    OMOP: https://athena.ohdsi.org/search-terms/terms/
  -
    OncoTree: http://purl.obolibrary.org/obo/ONCOTREE_
  -
    OPL: http://purl.obolibrary.org/obo/OPL_
  -
    orphanet: "https://identifiers.org/orphanet:"
  -
    owl: http://www.w3.org/2002/07/owl#
  -
    PATO: http://purl.obolibrary.org/obo/PATO_
  -
    PO: http://purl.obolibrary.org/obo/PO_
  -
    pombase: "https://identifiers.org/pombase:"
  -
    PR: http://purl.obolibrary.org/obo/PR_
  -
    rdf: https://www.w3.org/TR/2004/REC-owl-guide-20040210/#
  -
    rdfs: http://www.w3.org/2000/01/rdf-schema#
  -
    REPODB: http://apps.chiragjpgroup.org/repoDB#
  -
    RGD: "https://identifiers.org/rgd:"
  -
    RO: http://purl.obolibrary.org/obo/RO_
  -
    RTX: http://rtx.ai/identifiers#
  -
    RTXKG1: http://arax.rtx.ai/
  -
    sgd: "https://identifiers.org/sgd:"
  -
    SIO: http://semanticscience.org/resource/SIO_
  -
    skos: http://www.w3.org/2004/02/skos/core#
  -
    snomedct: "https://identifiers.org/snomedct:"
  -
    SO: http://purl.obolibrary.org/obo/SO_
  -
    NCBITaxon: http://purl.obolibrary.org/obo/NCBITaxon_
  -
    TO: http://purl.obolibrary.org/obo/TO_
  -
    TUI: https://identifiers.org/umls/STY/
  -
    UBERON: http://purl.obolibrary.org/obo/UBERON_
  -
    umls: "https://identifiers.org/umls:"
  -
    UMLS: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/
  -
    UO: http://purl.obolibrary.org/obo/UO_
  -
    wb: "https://identifiers.org/wb:"
  -
    wikidata: "https://identifiers.org/wikidata:"
  -
    ZEA: http://purl.obolibrary.org/obo/ZEA_
  -
    ZFA: http://purl.obolibrary.org/obo/ZFA_
  -
    zfin: "https://identifiers.org/zfin:"
balhoff commented 4 years ago

@saramsey looking good! I have a few comments:

Should your list and the biolink-model list be coordinated going forward? Just wondering if there are cases where you need to diverge, or if the biolink one shouldn't just be updated. There are clearly errors to be fixed in the biolink context.

cmungall commented 4 years ago

Everything @balhoff says it correct, but it's not necessary for you to go to N different registries

See https://biolink.github.io/biolink-model/#identifiers

We provide a jsonld context that you can use https://biolink.github.io/biolink-model/context.jsonld

I see there a few issues with some of them @kshefchek is working on this..

balhoff commented 4 years ago

@cmungall does that documentation mean that we should not make edits to https://github.com/biolink/biolink-model/blob/master/context.jsonld, but rather this has to be done at prefixcommons?

cmungall commented 4 years ago

Yes, the context.jsonld is entirely derived. Ultimately the upstream registries are the authorities. But we can prioritize one authority over another if there is a clash.

https://github.com/biolink/biolink-model/blob/ee4f4cb0930ff90ee9eb2ee0d8049b9f3c62f38f/biolink-model.yaml#L20-L24

And we can override and plug gaps directly:

https://github.com/biolink/biolink-model/blob/ee4f4cb0930ff90ee9eb2ee0d8049b9f3c62f38f/biolink-model.yaml#L6-L15

cmungall commented 4 years ago

We should probably have the jsonld context display in a more human friendly form in the derived documentation

TomConlin commented 4 years ago

Noting a need to be aware of the concepts of "external-base-uri" and "internal-base-uri" where KG's, ontologies and reasoning all benefit greatly from the nice regular consistent forms provided by third party resolvers which I am collectively referring to as "internal-base-uri".

In sometimes stark contrast are the irregular messy native "external-base-uri" which

The way forward I see (for publicly interfacing aspects) is to maintain both internal and external mapping for a common set of curie-prefixes and convert from and to as required.

Where required is typically to internal from external to make life easier and from internal to external for publicly publishing results without alienating our sources.


These are the mappings in common with the dipper curie_map.yaml that catch my eye as different. However both dipper's input and output is 100% public and as such may be a different different use case than a reasoner. But converging on common curie-prefixes is important in any case.

IAO http://purl.obolibrary.org/obo/IAO_

MGI http://www.informatics.jax.org/accession/MGI:

MUGEN   http://bioit.fleming.gr/mugen/Controller?workflow=ViewModel&expand_all=true&name_begins=model.block&eid=                                                                     

OMIM    http://omim.org/entry/

OMIMPS  http://www.omim.org/phenotypicSeries/

rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

RGD http://rgd.mcw.edu/rgdweb/report/gene/main.html?id=

skos    https://www.w3.org/TR/skos-reference/#

SNOMED  http://purl.obolibrary.org/obo/SNOMED_
saramsey commented 4 years ago

@saramsey looking good! I have a few comments:

  • ENVO: http://purl.obolibrary.org/obo/envo# should be http://purl.obolibrary.org/obo/ENVO_ (but I wouldn't be surprised if there are a few resources inside ENVO that start with http://purl.obolibrary.org/obo/envo#, usually those are ontology subset tags)
  • I don't think GARD is an OBO ontology
  • MeSH provides an RDF version and uses mesh: http://id.nlm.nih.gov/mesh/
  • rdf should be http://www.w3.org/1999/02/22-rdf-syntax-ns#

Thank you so much! I am going to fix these issues. If I find a suitable purl registry for GARD concepts, I will post it here.

  • Have you come across OBOREL terms? Just curious, because this was the predecessor of RO and I'm wondering if these IDs are still being used.

An example, from efo.owl:

<owl:onProperty rdf:resource="http://purl.org/obo/owl/OBO_REL#role_of"/>

and from hp.owl:

<dc:source>http://www.obofoundry.org/ro/#OBO_REL:preceded_by</dc:source>

Should your list and the biolink-model list be coordinated going forward? Just wondering if there are cases where you need to diverge, or if the biolink one shouldn't just be updated. There are clearly errors to be fixed in the biolink context.

Yes, coordination makes sense. I'm not aware of any places where we have to diverge; just some changes we would like to propose that would trigger updates to the biolink context.jsonld. If any true use-cases for divergence arise, I will definitely post here.

saramsey commented 4 years ago

FWIW, here is the latest version of the CURIE<->URL mappings that we are using for ARAX KG2: (the first section, use_for_bidirectional_mapping, is the one to look at; the other sections are for cleaning up incorrect or messed up URLs or CURIE prefixes that someone arose from upstream sources or during our ingestion processes):

https://github.com/RTXteam/RTX/blob/kg2-curie-refactoring/code/kg2/curies-to-urls-map.yaml

saramsey commented 4 years ago

Noting a need to be aware of the concepts of "external-base-uri" and "internal-base-uri" where KG's, ontologies and reasoning all benefit greatly from the nice regular consistent forms provided by third party resolvers which I am collectively referring to as "internal-base-uri".

In sometimes stark contrast are the irregular messy native "external-base-uri" which

  • our data sources actually produce and maintain
  • the wider population (nontologists) expect to see.
  • mashed-up/aggregated data from the wild is and will continue to use

The way forward I see (for publicly interfacing aspects) is to maintain both internal and external mapping for a common set of curie-prefixes and convert from and to as required.

Where required is typically to internal from external to make life easier and from internal to external for publicly publishing results without alienating our sources.

  • N.B. Harold states he has been sued for changing identifier urls.

These are the mappings in common with the dipper curie_map.yaml that catch my eye as different. However both dipper's input and output is 100% public and as such may be a different different use case than a reasoner. But converging on common curie-prefixes is important in any case.

IAO   http://purl.obolibrary.org/obo/IAO_

MGI   http://www.informatics.jax.org/accession/MGI:

MUGEN http://bioit.fleming.gr/mugen/Controller?workflow=ViewModel&expand_all=true&name_begins=model.block&eid=                                                                     

OMIM  http://omim.org/entry/

OMIMPS    http://www.omim.org/phenotypicSeries/

rdf   http://www.w3.org/1999/02/22-rdf-syntax-ns#

RGD   http://rgd.mcw.edu/rgdweb/report/gene/main.html?id=

skos  https://www.w3.org/TR/skos-reference/#

SNOMED    http://purl.obolibrary.org/obo/SNOMED_

Thank you @TomConlin ! I will specifically check these entries in the KG@ curies-to-urls-map.yaml file. I note that the above URL does not seem to work, in my hands:

http://purl.obolibrary.org/obo/SNOMED_106562006

but I can confirm that SNOMED CT concepts are available in purl.bioontology.org:

http://purl.bioontology.org/ontology/SNOMEDCT/106562006

FWIW, I am using the following sources to resolve URLs for identifiers with the above-referenced CURIE prefixes:

saramsey commented 4 years ago

Where required is typically to internal from external to make life easier and from internal to external for publicly publishing results without alienating our sources.

N.B. Harold states he has been sued for changing identifier urls.

Wow! I was not aware of that. It's somewhat astounding that an upstream source would sue (as opposed to sending a C&D letter) an individual developer over using an 'internal URL' over an 'external URL'.

hsolbrig commented 4 years ago

To be clear - threatened with suit. I'll not name the organization, but we assigned every organization an OID in the HL7 OID registry. So it was a C&D letter.

saramsey commented 4 years ago

To be clear - threatened with suit. I'll not name the organization, but we assigned every organization an OID in the HL7 OID registry. So it was a C&D letter.

Very helpful. Thank you.

cmungall commented 4 years ago

Closing this now as I believe it's well understood the canonical ID to URI expansion is here: https://biolink.github.io/biolink-model/#identifiers

open another issue if further clarification required!