biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
119 stars 51 forks source link

Confused about how Namespace in Pattern works #191

Closed dhimmel closed 2 years ago

dhimmel commented 3 years ago

An example Human Disease Ontology ID is DOID:0110974 and the prefix is doid, hence I agree that Namespace in Pattern is true at http://bioregistry.io/registry/doid

image

Why doesn't the example identifier include the DOID: prefix and is a bare 0110974 instead?

Now looking at EFO at http://bioregistry.io/registry/efo:

image

Namespace in Pattern is false. This surprises me because I never see bare EFO IDs like 0005147, but always EFO:0005147. How is EFO different than Disease Ontology such that namespace in prefix differs?

bgyori commented 3 years ago

As far as I know, and can tell based on providers, EFO IDs do not contain the namespace, they are just bare IDs. I do tend to agree that - despite the fact that Bioregistry can correctly resolve IDs even if the embedded namespace is not given - the Example Identifier should contain the namespace when it's part of the pattern.

dhimmel commented 3 years ago

As far as I know, and can tell based on providers, EFO IDs do not contain the namespace, they are just bare IDs.

Can you comment more on this? Does providers refer to identifiers.org?

If we look in the EFO OWL (which imports terms for other ontologies), I see the following snippets providing IDs to terms:

<oboInOwl:id>EFO:0003885</oboInOwl:id>
<oboInOwl:id>MONDO:0012237</oboInOwl:id>
<oboInOwl:id>DOID:7551</oboInOwl:id>

Now perhaps all of these IDs are in CURIE format, so this is not very informative? But I'm left wondering what the distinction between DOID (namespace in pattern) and MONDO/EFO (namespace not in pattern). In my experience, they are all similar in that they are always provided with the prefix and never in bare form.

bgyori commented 3 years ago

Yes, I meant e.g., identifiers.org, you can see the difference in the ID patterns,

image vs image

In terms of the OWL, EFO: is likely there explicitly because the EFO ontology contains elements from other ontologies like BFO. However, for consistency, one might expect to see DOID:DOID:7551 since DO IDs are supposed to have the namespace embedded...

dhimmel commented 3 years ago

Yes, I meant e.g., identifiers.org, you can see the difference in the ID patterns,

I am entertaining the possibility that identifiers.org (and the other registries) are wrong. Although perhaps they are so authoritative that their wrongs become future rights... and that it would be a mistake to deviate. Although I see bioregistry does deviate when identifiers.org sometimes, like adding ICD revisions (e.g. icd9).

@matentzn as someone who is involved in MONDO, is the MONDO: prefix part of the local identifier? @zoependlington, how about for EFO?

However, for consistency, one might expect to see DOID:DOID:7551 since DO IDs are supposed to have the namespace embedded...

I thought the whole purpose of the "Namespace in Pattern" field is such that you can just do id for the CURIE rather than prefix:id. Because DOID:DOID:7551 is something bad that we want to avoid? Although perhaps it is not bad, just that you never see it in the wild, like in ontology xrefs.

matentzn commented 3 years ago

I think this question raises an important issue about the naming of "things" in the process of dealing with prefixes, curies, IDS, URI namespace prefixes, local IDs etc etc. No one would ever refer to 0000001 to denote "Disease or Disorder", but I would still say that its is somehow the "local ID". - Once @cthoyt has a PR ready for the glossary, I will give more details.

cthoyt commented 3 years ago

I will return with a (hopefully) satisfying answer later today. Short version is that in OBO world, they consider entire CURIEs as local identifiers in the OBO namespace, and this nonsense has propagated far and wide, causing chaos and confusion for everyone (even reaching outside of the OBO Foundry to the EFO)

matentzn commented 3 years ago

Dont forget that by far the majority of the tools use something even worse, which is called short_form (add to glossary). For example, instead of using MONDO:0000001 (CURIE), they use MONDO_0000001 (short form). Tough! But we need facilities like regex checking to avoid this, to avoid MONDO:MONDO_0000001 croping up in our databases.

cthoyt commented 3 years ago

I'm not really sure what the difference between a short form and a CURIE are, other than the choice of a non-W3C-standard delimiter of _ instead of :.

Anyway, thanks for being patient while I've been writing. I've finished a first draft of a glossary and added a section about this at https://cthoyt.com/2021/10/07/biopragmatics-glossary.html#open-biomedical-ontologies-curies. I would be keen to continue arguing back and forth so I can improve this section as much as possible. I am pretty against the whole namespace embedded in LUI thing and will do my best to strengthen the arguments against it to begin the movement away.

cthoyt commented 3 years ago

Btw one of the other main reasons for confusion is a lot of people just write identifiers when they mean CURIEs, not local identifiers. So if you live in a world when someone has a column that's called ID and you see EFO:123457 in it, it means this column has a CURIE in it, and is just confusingly named. Same thing goes for all of the OBO in OWL - there's a deep misnomer there that an "identifier" is either a URI or CURIE.

I am in the camp that identifiers.org is wrong to have these regular expressions looking like this. It's super dangerous for people who don't know the historical OBO reasons for this and will lead to lots of bad choices, so perhaps the best thing to do on the bioregistry site is to show both the imported regular expression, and the normalized regular expression for each namespace, along with an explanation.

sorry about the mismash of answers down here, it's not so great since github doesn't have first class threaded discussion

matentzn commented 3 years ago

There is a technical difference between short_form and curie as well. Short-form is usually used to denote the remainder of a URL after the URL namespace has been cut off; So while MONDO_0000001 is the shortform for http://purl.obolibrary.org/obo/MONDO_0000001, license is the short form of http://purl.org/dc/terms/license. Its basically not anything - its a construct that was used for convenience by simply meaning: the "last part of the path or the URL". The only reason to mention it is to explicitly forbid its being used anywhere.

dhimmel commented 3 years ago

I am pretty against the whole namespace embedded in LUI

So based on @cthoyt's arguments here, it seems like one solution would be to:

  1. switch namespace in pattern to False for DOID, such that it matches EFO and MONDO.
  2. update the pattern for DOID to remove the DOID: prefix

Of course, this should be done in a consistent way across all records where namespace in pattern is True.

@cthoyt are there any vocabularies where you think namespace in pattern is True is justified, or is it always a superfluous complexity?

cthoyt commented 3 years ago

My opinion is the whole namespace in pattern is only a quirk introduced by Identifiers.org, and doesn't really provide a meaningful description of the world. It's due to a conflation of three things:

  1. What it means to be a local identifier inside an ID space
  2. What it means to be an IRI or URI which in the semantic web and ontology world is the only kind of valid "identifier"
  3. What it means to be compact URI (CURIE) which for some reason people also sometimes compact identifiers since they can somehow correspond to a URI given a prefix map

The namespace in LUI concept happens when people accidentally wrote a regular expression string for matching CURIEs from a given ID space when what they actually wanted was to match local identifiers in a given namespace. I've made a PR to update the website to better reflect this in #214:

Screen Shot 2021-10-11 at 17 03 45

It's still my opinion that the namespace embedded in LUI concept is completely redundant and my recommendation would be completely to remove it. Additionally, the patterns from Identifiers.org that correspond to CURIEs need to be updated to correspond to local identifiers, which I've done in proposed PR #213. I won't merge this PR until I get a consensus. I would of course love to petition Identifiers.org to update this, but they are 1) non-responsive to most requests and 2) there's a bigger issue to consider:

The only counterargument that I'd consider valid is "other people are relying on this wrong, misleading definition, and we want to make sure our stuff matches up to what exists already". I get this, since you have to invest a huge amount of effort to convince people to change (as I am doing now, and already getting some fatigue) in addition to having to work very hard across many external places to enact the change you want to see. This argument basically keeps any large community effort from making improvements and moving forwards once there are external stakeholders and people who rely on its bugs as features. See relevant XKCD: https://xkcd.com/1172/.

All of this being said, I think the Bioregistry is an opportunity to right some of the wrongs (which I'll stress all had good intentions at the time) of the last 20 years of biomedical semantics. I will very strongly stress this point at the workshop on the 29th of October (anyone who might be reading this that wants to join, please reach out if I didn't send you an email about it already). Hopefully we can iron out all of the vocabulary we use to talk about this stuff and then convince everyone that we can move forwards and pull the trigger on #213.

That all being said, I think the Bioregistry is also able to support doing it the more correct way and the old way at the same time due to the tight coupling of code and the data. There are two respective functions that let you create a "correct" local identifier and also a "legacy" local identifier. Neither of these are directly exposed through the top level interface, but if people really want to be able to use them, then it would be pretty trivial to make two functions bioregistry.normalize_identifier and bioregistry.legacize_identifier (excuse the awful reification of "legacy").

bgyori commented 3 years ago

I'm not sure that there should be pressure to "right some kind of wrong". At the end of the day, the key is for there to be a machine-readable standard shared by a large number of systems. From the perspective of software systems the style of identifiers really doesn't matter as long as there is a formal pattern that can be validated against. Most systems have taken the MIRIAM standard as reference (at least for prefixes covered there) and I believe it's difficult and disruptive to abandon that (obviously, Bioregistry significantly expands on the space of prefixes and fixes obvious mistakes but a stylistic change in the standard is not the same thing). I can certainly say that it would be a major effort and disruption to change identifier patterns in running code and persisted artifacts in the systems that I've developed, and at the end of the day there would be zero value created by the whole exercise.

bgyori commented 3 years ago

As a follow-up, what about identifier patterns where the primary provider of the identifier (not some third-party aggregator) clearly makes the namespace part of all their patterns like CHEBI? See screenshot: image It doesn't seem to me like one should be in the business of arbitrarily overriding the identifiers as provided by the original source due to stylistic preferences.

cthoyt commented 3 years ago

Across EBI resources, ID is often used to mean "compact identifiers" which in EBI vocabulary is synonymous with compact URIs (this is also what they call CURIEs on the Identifiers.org site). The fact that they're not using an unambiguious term, like local unique identifier (LUI) leaves this up to interpretation.

Looking deeper in ChEBI's database flatfile dumps: it uses only the integer-like identifiers (see https://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/compounds.tsv.gz, https://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names.tsv.gz, or others in https://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited). Example from the names (after I gunzipped it):

Screen Shot 2021-10-11 at 23 59 00

It's seems to also be the case that they're using an integer datatype in the SQL database schema for ChEBI local identifiers (see https://ftp.ebi.ac.uk/pub/databases/chebi/generic_dumps/pgsql_create_tables.sql). However, I didn't look directly into this to see if that primary key matches for each row to the actual ChEBI local identifier that the record represents, or if it's hiding in some other table.

In the OBO text file format for ontologies, the id: element refers to a compact URI, where the prefix is either assumed to be a valid prefix in the OBO PURL system or defined with an idspace: entry in the header of the OBO file. In the ChEBI OBO file, it uses CURIEs that look like CHEBI:138488, meaning that CHEBI is the prefix and 138488 is the local identifier. If it were the case that the local identifier itself were CHEBI:138488, then it would be expected for the line should read id: CHEBI:CHEBI:138488.

Screen Shot 2021-10-11 at 23 54 41

Finally, I opened up the lite version of the ontology in OWL format encoded in XML (this one just doesn't have axioms, which are irrelevant for this discussion). You can see that the OBO Library PURL http://purl.obolibrary.org/obo/CHEBI_138488 also is constructed where 138488 is the local identifier and not CHEBI:138488. Further down, you can see the artifacts of the "oboInOwl" schema where CURIEs themselves are represented as strings instead of using native IRIs (note that OWL is almost always using IRIs in the objects of triples, which are encoded in XML as the attribute values such as http://purl.obolibrary.org/obo/CHEBI_23000 in <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/CHEBI_23000"/>). Barring conversation about how this is not super great, at least these follow the same patterns as mentioned before.

<!-- http://purl.obolibrary.org/obo/CHEBI_138488 -->

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/CHEBI_138488">
        <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/CHEBI_23000"/>
        <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/CHEBI_35716"/>
        <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/CHEBI_38163"/>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/RO_0000087"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/CHEBI_35610"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/RO_0000087"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/CHEBI_50925"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/RO_0000087"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/CHEBI_64947"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/RO_0000087"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/CHEBI_68495"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/RO_0000087"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/CHEBI_82665"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/RO_0000087"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/CHEBI_91092"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/chebi#has_functional_parent"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/CHEBI_138487"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">An organic heterotetracyclic compound that is 1,3-dihydro-2H-1-benzazepin-2-one which shares its 4-5 bond with the 3-2 bond of 5-nitro-1H-indole.</obo:IAO_0000115>
        <oboInOwl:hasAlternativeId rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CHEBI:40939</oboInOwl:hasAlternativeId>
        <oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">chebi_ontology</oboInOwl:hasOBONamespace>
        <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CHEBI:138488</oboInOwl:id>
        <oboInOwl:inSubset rdf:resource="http://purl.obolibrary.org/obo/chebi#3_STAR"/>
        <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">alsterpaullone</rdfs:label>
    </owl:Class>

All of this is just long form to say: the annotation of the regular expression for the local unique identifiers in ChEBI in Identifiers.org is a misnomer (i.e., the entry on https://registry.identifiers.org/registry/chebi for Local Unique Identifier (LUI) pattern says ^CHEBI:\d+$), and generating CURIEs with double prefixes doesn't make sense. Now I'm wondering how many other people are constructing CURIEs with double prefixes in this style, and how they also came to doing it like this.

cmungall commented 3 years ago

Short version is that in OBO world, they consider entire CURIEs as local identifiers in the OBO namespace, and this nonsense has propagated far and wide, causing chaos and confusion for everyone (even reaching outside of the OBO Foundry to the EFO)

This isn't true!!!

The docs here go back to 2008:

http://wiki.geneontology.org/index.php/Identifiers

We have always tried to be very clear about our nomenclature and distinguishing between a prefixed ID such as GO:0008150, the prefix (GO), and the local part of the Id (0008150), consistent with W3 terminology.

I think a lot of confusion was introduced when ontologies were brought into identifiers.org due to the culture clash between OWL people who had one terminology for talking about identifiers and bioinformatics people who had another.

History aside, I agree with @cthoyt's analysis from 17 days ago.

However, I do think it is useful to have a flag that indicates a preferred human-friendly rendering of an identifier. Some databases have redundancy with the prefix in the local part (e.g. flybase gene IDs), and rendering of the local part is preferred in many contexts. But an OBO ID would never be rendered with just the local part. I think retaining this stylistic preference will help avoid MGI:MGI:nnn disasters

cthoyt commented 2 years ago

Thanks everyone for all of the great discussion. The ultimate decision was to address this with #213 where all of the regular expressions with redundant namespace information in the pattern in MIRIAM were overridden, and utility functions were renamed to better represent that the bioregistry is specifically interested in LUIs