Registries of Good Practice project

Introduction
- What is a registry?
Goals
Communication and Collaboration
- Updates (reverse chronological order)
Principles
User Needs
Licensing \& Copyright
Project Plan \& Timeline
Prototypes
- The Format Aggregator
- The Digital Preservation Publications Index v1.0

Introduction

The "Registries of Good Practice" Project is a collaboration between the Digital Preservation Coalition (DPC) and Yale University Library (YUL). The formal announcement can be found here. The project will explore and develop different approaches to analyze, collate, present and, most importantly, make discoverable the many existing registries and collections of digital preservation good practice.

The high-level project plan and progress is openly accessible as a GitHub project.
The offical DPC project page is here.

What is a registry?

The idea of a 'technical registry' is a bit of jargon with a long history in the field of digital preservation. The general idea was already well established at iPRES 2004:

Cooperation has always been essential in the digital preservation community regarding knowledge exchange and collaboration in research activities. As initiatives increasingly turn to implementation, cooperation also gains practical significance. Initiatives embark on collaboratively building services that are required by various preservation systems. This presentation addresses file format registries. The preservation community jointly calls for a register that identifies and documents file formats, to come to terms with the myriad of different file formats. Activities towards building a file format registry are emerging, as already some preservation initiatives rely on such a future service in their current approaches.

From File format registries - a global infrastructure for local persistence

An earlier 2003 publication suggests this terminology is inherited from the IANA Media Type Registry:

"The current MIME Media Types registry does not provide sufficient granularity of format typing or sufficient standardized representation information about formats."

From Towards a global digital format registry

But the idea of registries of reference information goes back much further, as indicated in this GOV.UK publication about what it takes to be a register and by the venerable registries referenced from there (like the UK Land Registry).

But for the purposes of this project, we're not interpreting this too strictly. Perhaps in the past we have focussed too much on the registry we need to build and rushed too far ahead. We want to focus on the community of digital preservation practitioners, and understand their needs and capabilities. We want to learn from the registries (and other information sources) that have been well-used and have stood the test of time. Anything that people can use to improve how they do digital preservation is in scope.

Goals

To explore the following issues:

What is the current ‘landscape’ of active and openly-accessible registries of things like format, software, tools and workflows, and how are they being used in practice? How do vendors integrate and use them? How does that compare with what practitioners do with them, and how registry maintainers build them? What works well? What’s not so good? How could things be improved?
How do practitioners work towards ‘good practice’, and could ‘registries of practice’ help? What methods do people use to improve their practices? How do we build on the work of others, and avoid reinventing the wheel? What are the good sources of information about practical digital preservation? How useful are the iPres proceedings in this regard? How can we improve the discoverability of these kinds of resources?
What are the common practices used to build and maintain technical registries themselves? What are the constraints we’re operating under? How does that affect the kind of approaches we can take? What kinds of contributions are welcome, and through which channels?

The intended outcomes are:

An established Preservation Registries Special Interest Group (PR-SIG) where people can come together to talk about digital preservation registries.
A range of formal and informal publications documenting the current state of our registries.
New tools and services to help us improve our practices and maintain our registries, built with long-term maintenance and sustainability in mind.

Communication and Collaboration

If you'd like to talk to me about the project, or anything else related to digital preservation and web archiving, feel free to book a call with me.

If you want to get involved, you could:

Join the Preservation Registries Special Interest Group.
Look out for project updates via the DPC website and DPC social media channels (Mastodon, Twitter/X, LinkedIn).
Look out for blogs on technical details via Andy Jackson's personal blog and his social media channels (Mastodon, Twitter/X, LinkedIn).
Check the aforementioned GitHub project board.
Watch this GitHub repository.
Create and comment on GitHub issues, submit pull requests, etc.

Updates (reverse chronological order)

2024-02-23 - DPC Blog: Goals & Principles
2024-02-16 - Formal project announcement: New project launched to help practitioners discover digital preservation resources

Principles

While it’s not yet clear what the outputs of this project will be, we feel it’s important to establish some of the principles that we will use to guide our work. This isn’t an exhaustive list, and these rules may need to be refined over time. But we’ve seen a lot of registry projects come and go, and this is our attempt to articulate what we’ve learned from that, and how that will shape our work.

No New Registries

We’ve been involved in digital preservation for a long time, and have seen a lot of short-term project funding spent on grand visions of what preservation registries could be. While those research projects have been very valuable in helping us understand what might be possible, very few have successfully made the transition from prototype to production. We believe most of these projects underestimated the barriers to adoption, the effort required to fill these registries, and the difficulties involved in managing data over time (especially around maintaining parallel ‘forks’ of curated datasets).

This project is not about ‘owning’ or ‘solving’ the registry problem. Maintaining registries of technical information is difficult work and requires a commitment of time and skills that are in short supply across our sector. It is very important to us that whatever we do will respectfully highlight the long-term and ongoing efforts that people and organisations have put into the registries we all depend on.

Empower Others

This project will deliver one full-time person of effort over two short years, so how do we maximise the long-term returns from this brief pulse of funding? We believe that, rather than replicating the kind of work that is already being done, it will be better for everyone if we find ways of contributing that help empower the existing communities in and around preservation registries. Therefore, we will start by building a deeper understanding of ‘the registries problem’ from a practical point of view, and look for ways to help.

Can we build tools that will help registry maintainers work more quickly and confidently? Can we provide resources that will help grow the community of people who use and contribute to registries? Can we make it easier to learn from the past and avoid common pitfalls or reinvented wheels? How can we be a force-multiplier for good practice?

Be Realistic

While we want to be ambitious and try new things, we also have to be realistic about the limitations of this project: it’s one full-time equivalent person for two years. We believe we can make a significant contribution to the practice of digital preservation, but this will require clarity of purpose and focussed prioritisation.

Similarly, we understand there are many pressures and complicating factors that mean the teams running our registries are already overstretched. We will aim to support those teams, and to avoid encouraging unreasonable demands or expectations.

Be Useful

This might sound obvious, but over the years there have been a lot of “if you build it they will come” digital preservation projects that foundered because practitioners were unable to integrate those products into their work. Whatever we build must be driven by genuine user need and must work with or around the practical barriers that practitioners, registry maintainers and vendors face.

This is where wider community feedback is critically important. We can’t tell if we’re succeeding unless you let us know whether what we’re doing is actually useful!

Iterate Early, Often, and in the Open

It is not clear what the shared needs and common barriers are for the digital preservation community, and safe spaces for discussion are only part of the solution. To ensure we’re on the right track, any tools or resources we develop must be rapidly and openly iterated. Broadly speaking, rather than gathering formal requirements, we will focus on generating experimental prototypes to probe the issues and provoke discussion. The biggest benefit of time-limited project funding is the opportunity to fail, and to learn from the failures.

Build on the Work of Others

There is so much great work out there, and we want to learn from it, build on it, and shout about it from the rooftops. Not just the preservation registries, but also many different resources and tools from a range of institutions and individuals. Publicising the good work of others will be critical to our success.

Our biggest fear is accidentally leaving someone out! There’s a lot going on and the person doing most of the work on this project has been heavily focussed on web archiving in recent years. Please don’t be offended if we seem to miss you out, and please don’t assume anything is obvious to us! Please get in touch!

Make it Easy to Maintain

While we hope to be able to continue at least some of this work beyond this initial two years, there are no guarantees. Therefore, it is critical that the output of the project is something that the DPC and the wider digital preservation community can maintain.

At first, while exploring and experimenting, we can relax this constraint a little, but it will always be borne in mind that the final results cannot be something that requires a lot of complex infrastructure or frequent maintenance. Quite what this means is also unclear at this point, but it’s safe to say it’s more like minimal computing and less like ChatGPT.

User Needs

This section is a very rough early draft.

We want the outputs of the project to be useful, so we want to focus clearly on user needs. This is in part based on the GOV.UK advice on user research, with the possibility of future Wardley mapping of the technology landscape being kept in mind.

Based on our previous experience, we're avoiding leaping immediately to user requirements. We have found that being asked to enumerate requirements up front tends to get lost in the details and overloaded with discordant expectations and unvoiced assumptions. We want to start by thinking about needs and capabilities in context, before getting into user stories as ways of framing requirements.

Roles

To make sure the needs are clear, it's necessary to make sure we clearly identify the different user roles. Note that individual users may act in more than one of these roles.

Patrons
- The people who we are doing all this preservation for, now and in the future.
- In OAIS terms, the Consumers of the material that is in the Archive.
Creators
- The people who create what we preserve.
- This includes any people who are in some sense 'in' what we preserve. e.g. people's information, or stories, or data.
- In OAIS terms, roughly corresponds to the Producers of the material that goes into the Archive, excepting that OAIS doesn't explicitly consider the people 'in' the Archive.
Custodians
- Whoever has overall responsibility for what is being preserved.
- Whoever has decision-making authority over how things are done.
- In OAIS terms, the Management, who defined the policies the Archive operates under.
Practitioners
- Someone who does the work involved in digital preservation. Handling files, ingesting into repository systems, managing replicas, facilitating access, etc.
- They act according to the policies established by the Custodian.
- In OAIS terms, they are the Archive. Or at least, the human part of the Archive. But OAIS doesn't draw that distinction.
Registry Contributors
- People who find relevant information and do the analysis required to prepare it for inclusion in a registry.
Registry Maintainers
- The people who run the long-term registry infrastructure we depend on.
- Works with internal and external Registry Contributors to add and update the contents of the registry.
Tool Maintainers
- People who maintain the tools that Practitioners and Platform Providers depend on.
- Often open source projects.
Platform Providers
- People that provide services and systems to help do the work of digital preservation.
- Often involves re-packaging the registries from the Registry Maintainers and the tools from the Tool Maintainers.
- May be open source or proprietary commercial vendors, in-house teams, or a mixture of both.
Researcher
- The people researching new theories and practices of digital preservation.
Funders
- Organisations that fund digital preservation work.
- Should perhaps distinguish between ongoing versus time-limited project funding?

Journeys and Needs

We are currently working on understanding the user journeys and needs relating to practice and registries in digital preservation. At this stage, this is focusing on how things work at Yale University Libraries. We will share what we find, and look to work with other Practitioners to refine and grow the results of this work.

In parallel with that work, we are building prototypes exploring ways of answering some of the questions we think are likely to come up:

What can we find out about File Format X?
Who else has worked with File Format X, and what did they do?
Who else has worked on DigiPres Problem Y and how did it go?
What tools can I use to do DigiPres Action Z?

The work on user journeys should help establish the relevant of these questions, and where the Practitioners are when they start asking these questions...

Asking in search engines?
Starting with a file?
Starting from a repository system?
Starting with a script?
...

User Stories

As a Practitioner, I want to know what software is needed to access the items in my collection.
- Therefore, I want to know what formats are in the collection, and to be able to use that to find what software is needed.
- Therefore, I want to know which format registries and tools can help me analyse the formats in my collections.
  - As running all the tools is expensive/difficult, I want to start the process based on the collection profile of file extensions.
As a Custodian, I want to understand my collection, e.g. what makes it distinctive, and find others with similar collections so we can share the burden of maintaining access to rarer formats.
- Therefore, I want to compare a collection profile against one or more other collection profiles, to understand how the formats and composition vary across institutions.

Licensing & Copyright

This is a draft policy and may be revised.

Source Data & Aggregate Data

We gather data from multiple sources for indexing, but the original data remains under the creators' terms of publication. Databases or other consumptive data sets remain bound by the terms of the original sources in each data set. Index data that is considered purely factual and non-consumptive may be made available under CC0 terms.

Note that these datasets are intended for research and analysis. They are not intended for re-use as part of an automated process (e.g. format identification), and are not suitable for using in that way. The aim of this project is to surface gaps, differences and conflicts between registries, so that interested parties can understand and resolve those tensions. Embedding these aggregated datasets in any automated process is likely to lead to inconsistent and unpredictable outcomes.

Source Code

The project source code is not intended nor suitable for embedding in closed or proprietary systems. As such, the default license for source code will be the AGPL-3.0.

If the project ends up creating tools or libraries that would be suitable for re-use, they will be distilled into separate repositories and made available under the terms of the MIT license. But in preference to that, wherever possible, the project will contribute to existing open source projects. Contributions to any third-party tools or libraries will be made under terms appropriate to that tool or library.

Documentation & Publications

Documentation published by the project will be made available under the terms of the AGPL-3.0 or CC-BY depending on context. Formal publications will be made available under CC-BY terms.

Project Plan & Timeline

The overall project progress and backlog is here. This table provides a high-level summary.

Quarter	Focus	Outputs
2023.Q1	- Project start-up, initial planning & comms. - 1st release of the DigiPres Publications Index.	blog
2023.Q2	- Begin updating the Format Aggregator to become the "Format Registry Index 2.0", including e.g. working with YUL to add basic software information from WikiData. - Create tools for comparing format profiles from registries with each other and with institutional holdings, starting with YUL. - Compare registry holdings and document initial results as "DigiPres Workbench 1.0".
2023.Q3	- Refine "DigiPres Workbench 1.0" based on internal feedback (DPC, Yale, PR-SIG). - Improved back-end for the "Format Registry Index 2.0", add more sources, add SQLite DB output. - Stakeholder engagement ahead of iPRES. - iPRES 2024: Present "DigiPres Publications Index 1.0" and launch "DigiPres Workbench 1.0" in this Workshop, gather feedback for future planning.
2023.Q4	- Update "DigiPres Workbench 1.x" based on feedback. - Collect and document the Hidden Gems, hopefully sharing some on WDPD. - Complete "Format Registry Index 2.0", working with YUL to link from the Format Index to EaaSI.
2024.Q1	- DigiPres Publication Index v2.0 - Work with Open Preservation Foundation to come up with a plan for the COPTR Tool Registry. - Start planning for the end of the project, ongoing funding for ANJ, etc.
2024.Q2	- ...TBC...
2024.Q3	- ...TBC...
2024.Q4	- ...TBC...

n.b. This is updated in Obsidian as Markdown tables are a bit of a pain.

Prototypes

A crucial part of the project is generating prototypes to help us explore what might be possible. Please let us know any feedback you have, good or bad!

The Format Aggregator

The format aggregator is at https://digipres.org/formats/

It predates this project, but it's history, purpose and architecture are closely related to the current work. It will be used as the basis for an improved prototype.

It is hosted on GitHub Pages, as part of the https://digipres.org/ website. See https://github.com/digipres/digipres.github.io
The site is built from the source files using Jekyll, with a simple custom theme based on Bootstrap.
The aggregator code is in https://github.com/digipres/sentinel which uses daily GitHub Actions to gather the source information and update the digipres.github.io submodule.

This arrangement has been running for around ten years, with the aggregator only needing occasional updates and fixes (perhaps a few hours a year on average). The user interface is composed entirely of static resources hosted on free services, and so requires almost no maintenance.

The Digital Preservation Publications Index v1.0

The first new output of this project is at https://digipres.org/publications/

As described there, it pools records of digital preservation practice into browseable and searchable form. It is initially focussed on surfacing the individual publications from the iPRES conference proceedings.

It is hosted on GitHub Pages, out of a separate repository: https://github.com/digipres/publications/ - the GitHub Pages service automatically deploys it as a sub-section of the parent site.
The site is built from the source files using Jekyll, using the Just The Docs theme.
The code that collects the iPRES publications and generates Markdown versions is in https://github.com/digipres/digipres-practice-index - it is not set up to run automatically as that is not appropriate in this case.
That code also generates an SQLite version of the data, which can be used in many tools. The publications site provides a suitable example using Datasette Lite.
The site is also set up to deploy on Netlify as https://digipres-org-publications.netlify.app/:
- This enables us to provide an alternative editor interface using DecapCMS, available at: https://digipres-org-publications.netlify.app/admin/
- It allows GitHub users to update some of the pages of the site, e.g. the pages for each conference, with all the authentication and content management handled via GitHub accounts and pull-requests.
- The Netlify account is at https://app.netlify.com/teams/digipres/overview and is only accessible by Andrew Jackson at present.
- Using DecapCMS is optional, as content can always be managed directly in GitHub or using other tools. There are also other ways of deploying DecapCMS than on Netlify, which can be explored if necessary.

References to the DigiPres Publications Index:

Mentioned in this blog on making data available in SQLite and similar formats.
iPRES 2024 site will link keywords to index search pages.

digipres / registries-of-practice-project

readme