biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
120 stars 53 forks source link

Prefix Policies #158

Open cthoyt opened 3 years ago

cthoyt commented 3 years ago

This is a placeholder issue to discuss prefix orthography and typography, which are fancy words describing how prefixes should look and feel.

Minimum Prefix Requirements

Subnamespacing

See #133

Handling Collisions

Removal of Prefixes

Typically, prefixes should not be removed from the Bioregistry, even if they correspond to subsumed, abandoned, or dead resources, because it is also a historical archive and reference for anyone who might run into legacy prefixes in legacy resources. It has not happened often that prefixes have even collided. One example is gene expression omnibus vs. geographical entity ontology, which are both maintained. Another example is the disease class annotation (legacy classification from the Disease Ontology) and Dublin Core, where one is obviously more important than the other.

Choosing a Good Prefix

See also the OBO Foundry policy on choosing a prefix (i.e., IDSPACE) at http://obofoundry.org/id-policy.html

Who can request a prefix

Minimum Metadata Requirements

The required fields are maintained in the new prefix issue template including the following:

Optional Metadata

The PyDantic model bioregistry.schema.struct.Resource lists most fields as optional because it's relying on dispatch to other externally mapped data from identifiers.org, OBO foundry, etc..

TODO text descriptions of the rest

bgyori commented 3 years ago

These look great to me! Why do you think the "New prefixes must start with a letter." constraint is necessary? There are some resources whose name happens to start with a number and in those cases I think it's fine if the prefix does too. In terms of removal of prefixes, if there is no way to resolve a prefix since the underlying resource is unavailable, is there a way to curate that status?

cthoyt commented 3 years ago

Why do you think the "New prefixes must start with a letter." constraint is necessary? There are some resources whose name happens to start with a number and in those cases I think it's fine if the prefix does too.

@bgyori you're right, there are already three prefixes that start with a number:

I'm not completely against allowing numbers. I was thinking in programming languages, starting with a number wouldn't be a valid identifier, but prefixes don't really have the same semantics. However, I think it should remain that a prefix can not start with a dot.

On the other hand, I don't think there's an issue with imposing new standards on future prefixes.

In terms of removal of prefixes, if there is no way to resolve a prefix since the underlying resource is unavailable, is there a way to curate that status?

There is a field deprecated in that's optional for all prefixes. I guess I should make an enumeration of optional fields and what they're for to include in the policy here as well

bgyori commented 3 years ago

I agree with all of that, thanks!

dhimmel commented 3 years ago

Multiple prefixes will not be issued for multiple versions of a resource

What about ICD9 / ICD10 / ICD11 where multiple prefixes make sense because no identifiers are shared between releases? Does this policy need more nuance on what version refers to?

cthoyt commented 3 years ago

Multiple prefixes will not be issued for multiple versions of a resource

What about ICD9 / ICD10 / ICD11 where multiple prefixes make sense because no identifiers are shared between releases? Does this policy need more nuance on what version refers to?

Excellent point - I would go as far as to say ICD9, ICD10, and ICD11 are completely different resources. However, based on the name, it's not fair to assume this is obvious unless people have domain knowledge. What do you think we should call the difference between the ICD's and just different releases of the same database?

Even further, the fact that ICD by itself refers to ICD10 is a bad thing and should be discouraged. This is a remnant of Identifiers.org's curation choices, but also is pretty consistent throughout resources like EFO and DOID

dhimmel commented 3 years ago

What do you think we should call the difference between the ICD's and just different releases of the same database?

Maybe "disjoint releases" meaning that no identifiers are shared across releases and then give the example of ICDs.

the fact that ICD by itself refers to ICD10 is a bad thing and should be discouraged

Agree. Bound to cause big problems when ICD-11 adoption increases.

  • New prefixes must validate against the following regular expression: ^[a-z][a-z0-9]+(\.[a-z][a-z0-9]+?)$

That pattern is broken. The ? for the optional subspace needs to be moved outside the group (so it applies to the whole group). The . needs to be escaped. How about ^[a-z][a-z0-9]+(\.[a-z][a-z0-9]+)?$?