Prefix Policies - Githubissues

cthoyt commented 3 years ago

This is a placeholder issue to discuss prefix orthography and typography, which are fancy words describing how prefixes should look and feel.

Minimum Prefix Requirements

New prefixes are allowed to contain letters [a-z], numbers [0-9], and a single dot . if a subspace is requested.
New prefixes must start with a letter.
New prefixes must be at least two characters. Ideally, prefixes should be three or more characters for legibility.
Subspaces must start with a letter.
Subspaces must be at least two characters. Ideally, subspaces should be three or more characters for legibility.
New prefixes must be lowercase. However, lexical variants can be stored as synonyms for reference (e.g., FBbt).
New prefixes must validate against the following regular expression: ^[a-z][a-z0-9]+(\.[a-z][a-z0-9]+?)$
New prefixes must not be on the Bioregistry's prefix blacklist, which contains a short number of enumerated over-general prefixes (e.g., gene, covid) and a list of prefixes that would cause bugs in the web service (e.g., overview, registry, downloads)

Subnamespacing

See #133

Handling Collisions

New prefixes must not collide with any canonical prefixes, preferred prefixes, synonyms, or normalized variants thereof
The Bioregistry tracks collisions due to external registry heterogeneity, please make an issue to start discussion
If a new contributor wants to register a prefix that is already present in the Bioregistry, then precedence will be given to the already existing prefix and the contributor will be asked to choose a different prefix (i.e., first come, first serve)

Removal of Prefixes

Typically, prefixes should not be removed from the Bioregistry, even if they correspond to subsumed, abandoned, or dead resources, because it is also a historical archive and reference for anyone who might run into legacy prefixes in legacy resources. It has not happened often that prefixes have even collided. One example is gene expression omnibus vs. geographical entity ontology, which are both maintained. Another example is the disease class annotation (legacy classification from the Disease Ontology) and Dublin Core, where one is obviously more important than the other.

Choosing a Good Prefix

Prefixes should be chosen in a way to minimize confusion
- Prefixes should correspond to the name of the resource they are minted for. Most commonly, people use acronyms.
Multiple prefixes will not be issued for multiple versions of a resource (e.g., the fact that there is a mesh.2012 and mesh.2013 prefix registered in Identifiers.org was a huge mistake and causes massive confusion)
Prefixes should not be too generic or common entity types like gene or chemical. Reviewers will use their best judgement since it's hard to list all possible generic entity types. For example, gene would be bad while hgnc.gene would be better.
Subspacing should not be used unnecessarily, i.e., when a nomenclature only has one entity type. For example, chebi.chemical would be bad while chebi would be better.
Prefixes should not end in "O" for "Ontology", "T" for "Terminology" or any letters denoting related words about vocabularies

See also the OBO Foundry policy on choosing a prefix (i.e., IDSPACE) at http://obofoundry.org/id-policy.html

Who can request a prefix

A prefix can be requested by anyone, even if it is for a resource they do not themselves maintain. Obviously, it would be easier for a nomenclature maintainer to provide some metadata that's required to go with a given prefix

Minimum Metadata Requirements

The required fields are maintained in the new prefix issue template including the following:

name
description
homepage
example local identifier
regular expression pattern
TODO requester's name and ORCID identifier

Optional Metadata

The PyDantic model bioregistry.schema.struct.Resource lists most fields as optional because it's relying on dispatch to other externally mapped data from identifiers.org, OBO foundry, etc..

Contact email/name for resource (highly recommended)
deprecation status (bool)

TODO text descriptions of the rest

bgyori commented 3 years ago

These look great to me! Why do you think the "New prefixes must start with a letter." constraint is necessary? There are some resources whose name happens to start with a number and in those cases I think it's fine if the prefix does too. In terms of removal of prefixes, if there is no way to resolve a prefix since the underlying resource is unavailable, is there a way to curate that status?

cthoyt commented 3 years ago

Why do you think the "New prefixes must start with a letter." constraint is necessary? There are some resources whose name happens to start with a number and in those cases I think it's fine if the prefix does too.

@bgyori you're right, there are already three prefixes that start with a number:

I'm not completely against allowing numbers. I was thinking in programming languages, starting with a number wouldn't be a valid identifier, but prefixes don't really have the same semantics. However, I think it should remain that a prefix can not start with a dot.

On the other hand, I don't think there's an issue with imposing new standards on future prefixes.

In terms of removal of prefixes, if there is no way to resolve a prefix since the underlying resource is unavailable, is there a way to curate that status?

There is a field deprecated in that's optional for all prefixes. I guess I should make an enumeration of optional fields and what they're for to include in the policy here as well

bgyori commented 3 years ago

I agree with all of that, thanks!

dhimmel commented 3 years ago

Multiple prefixes will not be issued for multiple versions of a resource

What about ICD9 / ICD10 / ICD11 where multiple prefixes make sense because no identifiers are shared between releases? Does this policy need more nuance on what version refers to?

cthoyt commented 3 years ago

Multiple prefixes will not be issued for multiple versions of a resource

What about ICD9 / ICD10 / ICD11 where multiple prefixes make sense because no identifiers are shared between releases? Does this policy need more nuance on what version refers to?

Excellent point - I would go as far as to say ICD9, ICD10, and ICD11 are completely different resources. However, based on the name, it's not fair to assume this is obvious unless people have domain knowledge. What do you think we should call the difference between the ICD's and just different releases of the same database?

Even further, the fact that ICD by itself refers to ICD10 is a bad thing and should be discouraged. This is a remnant of Identifiers.org's curation choices, but also is pretty consistent throughout resources like EFO and DOID

dhimmel commented 3 years ago

What do you think we should call the difference between the ICD's and just different releases of the same database?

Maybe "disjoint releases" meaning that no identifiers are shared across releases and then give the example of ICDs.

the fact that ICD by itself refers to ICD10 is a bad thing and should be discouraged

Agree. Bound to cause big problems when ICD-11 adoption increases.

New prefixes must validate against the following regular expression: ^[a-z][a-z0-9]+(\.[a-z][a-z0-9]+?)$

That pattern is broken. The ? for the optional subspace needs to be moved outside the group (so it applies to the whole group). The . needs to be escaped. How about ^[a-z][a-z0-9]+(\.[a-z][a-z0-9]+)?$?

biopragmatics / bioregistry

Prefix Policies #158

Minimum Prefix Requirements

Subnamespacing

Handling Collisions

Removal of Prefixes

Choosing a Good Prefix

Who can request a prefix

Minimum Metadata Requirements

Optional Metadata