Restructure DataCite datacenters (repositories)

EZID DataCite shoulders are associated with a "datacenter" code (e.g., CDL.UCB), which is a requirement for registering identifiers with DataCite. DataCite currently refers to these datacenters as "repositories" and they are used in DataCite APIs and other services to identify DOIs associated with a particular repository, group, project, or organization.

DataCite's notion of datacenter/repository has gone through multiple iterations. Historically, it used to be more common for an organization like CDL to have a high number of datacenters, typically associated with an individual user account and not corresponding to any actual repository structures or groupings. When DataCite made an adjustment to its fee model circa 2019 that based pricing primarily on datacenters, CDL consolidated its datacenters and rolled them up to the campus level (11 total including CDL) to normalize our approach and rationalize the payment structure for our DataCite membership. In 2020/2021, the fee model changed again and CDL became a consortium. As part of this change, DataCite evolved the notion of datacenter into the current repository model, and pricing was no longer determined by repositories.

On the EZID side, we are still operating on a one repository per campus model. This means that DOIs registered by individual repositories are not associated with a specific repository ID in DataCite's APIs and other services, which can inhibit some users' ability to track and identify research outputs. (Possible workaround would be searching on a publisher field, affiliation, or prefix.)

EZID has yet to "un-consolidate" its datacenters and align with DataCite's current repository-based structure for a few reasons: (1) the list of datacenters is hard-coded in EZID and can't be changed without significant development work, (2) the implications of losing the campus-based identifier need to be fully understood and explored, and (3) all existing DataCite user accounts in EZID need to be reviewed and investigated to determine how they would map to a new repository structure and the appropriate level of granularity (for example, should there be a single repository for NCEAS, or multiple repositories for different types of repositories/projects that NCEAS is involved in). This work is not trivial and there has not been sufficient bandwidth to take it on in recent years without undermining other work.

To summarize the state of affairs:

EZID setup's pre-dates the current DataCite setup and has had to make adjustments along the way while DataCite figured out pricing model and repository structure
EZID currently uses one repository (datacenter) per campus. Within each campus, each project/department uses a different prefix
Under EZID's current setup, we have multiple prefixes (shoulders) associated with each datacenter. A prefix can only have one datacenter, but a datacenter can have multiple prefixes.
DataCite would prefer (and at some point may require) that we move away from the one repository per campus structure
DataCite's repository concept does not naturally align with current practices across all of UC, so significant mapping/un-consolidation work would be required to adapt to this structure
Dividing into multiple repositories per campus could have some benefits in terms of tracking outputs, but it could also undermine tracking at the campus level
Based on DataCite's current pricing model, we can have an unlimited number of repositories per consortium organization without adversely impacting costs.
DataCite would prefer one prefix per repository, but we can still have multiple prefixes for a single repository on a case-by-case basis.
One unexpected limitation of retaining our current structure is that when DataCite reinstated the ability for consortium organizations to allocate prefixes, this change did not apply to us, so we still have to contact them manually to request a new prefix

More granular datacenter is useful for datacite commons, other tracking DataCite can help move DOIs when we’re ready

Key steps to move this forward

[ ] Summarize benefits and implications of un-consolidating our datacenters
[ ] Review current user activity on each campus and determine how each group/project would be mapped to a new repository-based structure within DataCite
[ ] Map out new repository plan and generate reports of which DOIs would be moved to which repositories
[ ] Change EZID configuration to allow for additional datacenters to be created (i.e., perhaps a new management command when generating a new shoulder), and for DOIs to be registered with the new datacenters
[ ] Notify users of a cutoff time to allow DataCite to transfer DOIs to their new repositories

Potential questions to investigate

How will existing DOIs be affected?
What happens if a datacenter name changes? (what is DataCite’s policy/practice in this regard?)
What happens if a prefix needs to change to a new datacenter?
How can we leverage DataCite Commons to show outputs by repository as well as campus?

Notes from October 2022 conversation with DataCite

CDLUC3 / ezid-service

Restructure DataCite datacenters (repositories) #256