digipres / registries-of-practice-project

The "Registries of Good Practice" Project
MIT License
9 stars 1 forks source link

Registries of Good Practice project

Introduction

The "Registries of Good Practice" Project is a collaboration between the Digital Preservation Coalition (DPC) and Yale University Library (YUL). The formal announcement can be found here. The project will explore and develop different approaches to analyze, collate, present and, most importantly, make discoverable the many existing registries and collections of digital preservation good practice.

What is a registry?

The idea of a 'technical registry' is a bit of jargon with a long history in the field of digital preservation. The general idea was already well established at iPRES 2004:

Cooperation has always been essential in the digital preservation community regarding knowledge exchange and collaboration in research activities. As initiatives increasingly turn to implementation, cooperation also gains practical significance. Initiatives embark on collaboratively building services that are required by various preservation systems. This presentation addresses file format registries. The preservation community jointly calls for a register that identifies and documents file formats, to come to terms with the myriad of different file formats. Activities towards building a file format registry are emerging, as already some preservation initiatives rely on such a future service in their current approaches.

From File format registries - a global infrastructure for local persistence

An earlier 2003 publication suggests this terminology is inherited from the IANA Media Type Registry:

"The current MIME Media Types registry does not provide sufficient granularity of format typing or sufficient standardized representation information about formats."

From Towards a global digital format registry

But the idea of registries of reference information goes back much further, as indicated in this GOV.UK publication about what it takes to be a register and by the venerable registries referenced from there (like the UK Land Registry).

But for the purposes of this project, we're not interpreting this too strictly. Perhaps in the past we have focussed too much on the registry we need to build and rushed too far ahead. We want to focus on the community of digital preservation practitioners, and understand their needs and capabilities. We want to learn from the registries (and other information sources) that have been well-used and have stood the test of time. Anything that people can use to improve how they do digital preservation is in scope.

Goals

To explore the following issues:

The intended outcomes are:

Communication and Collaboration

If you'd like to talk to me about the project, or anything else related to digital preservation and web archiving, feel free to book a call with me.

If you want to get involved, you could:

Updates (reverse chronological order)

Principles

While it’s not yet clear what the outputs of this project will be, we feel it’s important to establish some of the principles that we will use to guide our work. This isn’t an exhaustive list, and these rules may need to be refined over time. But we’ve seen a lot of registry projects come and go, and this is our attempt to articulate what we’ve learned from that, and how that will shape our work.

No New Registries

We’ve been involved in digital preservation for a long time, and have seen a lot of short-term project funding spent on grand visions of what preservation registries could be. While those research projects have been very valuable in helping us understand what might be possible, very few have successfully made the transition from prototype to production. We believe most of these projects underestimated the barriers to adoption, the effort required to fill these registries, and the difficulties involved in managing data over time (especially around maintaining parallel ‘forks’ of curated datasets).

This project is not about ‘owning’ or ‘solving’ the registry problem. Maintaining registries of technical information is difficult work and requires a commitment of time and skills that are in short supply across our sector. It is very important to us that whatever we do will respectfully highlight the long-term and ongoing efforts that people and organisations have put into the registries we all depend on.

Empower Others

This project will deliver one full-time person of effort over two short years, so how do we maximise the long-term returns from this brief pulse of funding?  We believe that, rather than replicating the kind of work that is already being done, it will be better for everyone if we find ways of contributing that help empower the existing communities in and around preservation registries. Therefore, we will start by building a deeper understanding of ‘the registries problem’ from a practical point of view, and look for ways to help.

Can we build tools that will help registry maintainers work more quickly and confidently? Can we provide resources that will help grow the community of people who use and contribute to registries? Can we make it easier to learn from the past and avoid common pitfalls or reinvented wheels? How can we be a force-multiplier for good practice?

Be Realistic

While we want to be ambitious and try new things, we also have to be realistic about the limitations of this project: it’s one full-time equivalent person for two years. We believe we can make a significant contribution to the practice of digital preservation, but this will require clarity of purpose and focussed prioritisation.

Similarly, we understand there are many pressures and complicating factors that mean the teams running our registries are already overstretched. We will aim to support those teams, and to avoid encouraging unreasonable demands or expectations.

Be Useful

This might sound obvious, but over the years there have been a lot of “if you build it they will come” digital preservation projects that foundered because practitioners were unable to integrate those products into their work. Whatever we build must be driven by genuine user need and must work with or around the practical barriers that practitioners, registry maintainers and vendors face.

This is where wider community feedback is critically important. We can’t tell if we’re succeeding unless you let us know whether what we’re doing is actually useful!

Iterate Early, Often, and in the Open

It is not clear what the shared needs and common barriers are for the digital preservation community, and safe spaces for discussion are only part of the solution. To ensure we’re on the right track, any tools or resources we develop must be rapidly and openly iterated. Broadly speaking, rather than gathering formal requirements, we will focus on generating experimental prototypes to probe the issues and provoke discussion. The biggest benefit of time-limited project funding is the opportunity to fail, and to learn from the failures. 

Build on the Work of Others

There is so much great work out there, and we want to learn from it, build on it, and shout about it from the rooftops. Not just the preservation registries, but also many different resources and tools from a range of institutions and individuals. Publicising the good work of others will be critical to our success.

Our biggest fear is accidentally leaving someone out! There’s a lot going on and the person doing most of the work on this project has been heavily focussed on web archiving in recent years.  Please don’t be offended if we seem to miss you out, and please don’t assume anything is obvious to us! Please get in touch!

Make it Easy to Maintain

While we hope to be able to continue at least some of this work beyond this initial two years, there are no guarantees. Therefore, it is critical that the output of the project is something that the DPC and the wider digital preservation community can maintain.

At first, while exploring and experimenting, we can relax this constraint a little, but it will always be borne in mind that the final results cannot be something that requires a lot of complex infrastructure or frequent maintenance. Quite what this means is also unclear at this point, but it’s safe to say it’s more like minimal computing and less like ChatGPT.

User Needs

This section is a very rough early draft.

We want the outputs of the project to be useful, so we want to focus clearly on user needs. This is in part based on the GOV.UK advice on user research, with the possibility of future Wardley mapping of the technology landscape being kept in mind.

Based on our previous experience, we're avoiding leaping immediately to user requirements. We have found that being asked to enumerate requirements up front tends to get lost in the details and overloaded with discordant expectations and unvoiced assumptions. We want to start by thinking about needs and capabilities in context, before getting into user stories as ways of framing requirements.

Roles

To make sure the needs are clear, it's necessary to make sure we clearly identify the different user roles. Note that individual users may act in more than one of these roles.

Journeys and Needs

We are currently working on understanding the user journeys and needs relating to practice and registries in digital preservation. At this stage, this is focusing on how things work at Yale University Libraries. We will share what we find, and look to work with other Practitioners to refine and grow the results of this work.

In parallel with that work, we are building prototypes exploring ways of answering some of the questions we think are likely to come up:

The work on user journeys should help establish the relevant of these questions, and where the Practitioners are when they start asking these questions...

User Stories

Licensing & Copyright

This is a draft policy and may be revised.

Source Data & Aggregate Data

We gather data from multiple sources for indexing, but the original data remains under the creators' terms of publication. Databases or other consumptive data sets remain bound by the terms of the original sources in each data set. Index data that is considered purely factual and non-consumptive may be made available under CC0 terms.

Note that these datasets are intended for research and analysis. They are not intended for re-use as part of an automated process (e.g. format identification), and are not suitable for using in that way. The aim of this project is to surface gaps, differences and conflicts between registries, so that interested parties can understand and resolve those tensions. Embedding these aggregated datasets in any automated process is likely to lead to inconsistent and unpredictable outcomes.

Source Code

The project source code is not intended nor suitable for embedding in closed or proprietary systems. As such, the default license for source code will be the AGPL-3.0.

If the project ends up creating tools or libraries that would be suitable for re-use, they will be distilled into separate repositories and made available under the terms of the MIT license. But in preference to that, wherever possible, the project will contribute to existing open source projects. Contributions to any third-party tools or libraries will be made under terms appropriate to that tool or library.

Documentation & Publications

Documentation published by the project will be made available under the terms of the AGPL-3.0 or CC-BY depending on context. Formal publications will be made available under CC-BY terms.

Project Plan & Timeline

The overall project progress and backlog is here. This table provides a high-level summary.

Quarter Focus Outputs
2023.Q1 - Project start-up, initial planning & comms.
- 1st release of the DigiPres Publications Index.
blog
2023.Q2 - Begin updating the Format Aggregator to become the "Format Registry Index 2.0", including e.g. working with YUL to add basic software information from WikiData.
- Create tools for comparing format profiles from registries with each other and with institutional holdings, starting with YUL.
- Compare registry holdings and document initial results as "DigiPres Workbench 1.0".
2023.Q3 - Refine "DigiPres Workbench 1.0" based on internal feedback (DPC, Yale, PR-SIG).
- Improved back-end for the "Format Registry Index 2.0", add more sources, add SQLite DB output.
- Stakeholder engagement ahead of iPRES.
- iPRES 2024: Present "DigiPres Publications Index 1.0" and launch "DigiPres Workbench 1.0" in this Workshop, gather feedback for future planning.
2023.Q4 - Update "DigiPres Workbench 1.x" based on feedback.
- Collect and document the Hidden Gems, hopefully sharing some on WDPD.
- Complete "Format Registry Index 2.0", working with YUL to link from the Format Index to EaaSI.
2024.Q1 - DigiPres Publication Index v2.0
- Work with Open Preservation Foundation to come up with a plan for the COPTR Tool Registry.
- Start planning for the end of the project, ongoing funding for ANJ, etc.
2024.Q2 - ...TBC...
2024.Q3 - ...TBC...
2024.Q4 - ...TBC...

n.b. This is updated in Obsidian as Markdown tables are a bit of a pain.

Prototypes

A crucial part of the project is generating prototypes to help us explore what might be possible. Please let us know any feedback you have, good or bad!

The Format Aggregator

The format aggregator is at https://digipres.org/formats/

It predates this project, but it's history, purpose and architecture are closely related to the current work. It will be used as the basis for an improved prototype.

This arrangement has been running for around ten years, with the aggregator only needing occasional updates and fixes (perhaps a few hours a year on average). The user interface is composed entirely of static resources hosted on free services, and so requires almost no maintenance.

The Digital Preservation Publications Index v1.0

The first new output of this project is at https://digipres.org/publications/

As described there, it pools records of digital preservation practice into browseable and searchable form. It is initially focussed on surfacing the individual publications from the iPRES conference proceedings.

References to the DigiPres Publications Index: