Awesome Digital Preservation
Carefully curated list of awesome digital preservation resources.
This Awesome List is one a suite of community-owned resources for digital preservation. See digipres.org or the digipres.org discussion forum for more information.
Contributions are welcome. Please add links through pull requests, or create an issue to start a discussion. Please refer to CONTRIBUTING.md for detailed guidance. And if obsolescence claims something awesome, there's always the Archive.
The text of an annual reminder email about these resources is also held here, in reminder.md. This will be sent around various mailings list once per year, ahead of World Digital Preservation Day.
Contents
Get Started
Save Digital Stuff Right Now
Spotted digital data at risk, but don't know who can save it?
Learn About Digital Preservation
Find Formats
We need to understand the file formats of the resources we care for, and the software they depend on.
If you have good examples of digital resources and their risks, please consider adding them to a test corpus.
Experiment with Tools
There are a lot of tools out there (see the tools section below), but some tools are particularly great for early experimentation. These tools can be used right in your web browser, so you can get started without installing software locally.
Remote Services
These tools are accessed using your browser, and work by sending a copy of your files to a remote server.
In-Browser Tools
These tools run entirely in your web browser, so no data is sent anywhere.
- Siegfried JS - This runs the Siegfried format identification tool on your files in your browser.
- CyberChef - The Cyber Swiss Army Knife. Capable of running lots of basic data operations on text or files, including computing things like MD5 or SHA hashes.
- warc-analyser - Proof-of-concept that analyses WARC files in your browser. See https://github.com/edsu/warc-analyzer for more information.
Engage Stakeholders
- Visual examples of digital preservation challenges, such as graphic corruption, can be incredibly useful in communicating the digital preservation message. That's why we built the Atlas of Digital Damages Gallery and website. Please add your own images of a digital preservation challenge, failed rendering, encoding damage, corrupt data, or visual evidence documenting to the Atlas of Digital Damages.
- Use the POWRR One Pagers to educate stakeholders about the issues.
- Working with your IT department (some responses arising from this question on twitter:
Become Part Of The Digital Preservation Community
Advance digital preservation by pooling our experience, sharing our stories and finding the answers to the big questions.
- Q&A:
- Forums
- Discussion forums and active blogs provide the opportunity to share informal advice and war stories, get recommendations and discuss the finer points of digital preservation. By sharing both your intentions for digital preservation work and your results, you can ensure your work benefits from a wealth of community experience.
- Discuss preservation issues on the Digital Curation forum
- Share war stories on OPF blogs
- Mastodon - Join these federations with a digital preservation or general GLAM focus:
- Twitter - Use these lists to find people to follow:
- r/DataHoarder - "We are digital librarians."
- r/Archiveteam - "Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage."
- Join the Digital POWRR Slack
- Face-to-Face communities/support groups:
- Collaborations (inc. groups that build things together):
- Models, Standards & Certification:
- Conferences:
- Membership organizations:
Store Digital Content
Create Preservation Metadata
Find Test Files
To improve our digital preservation tools, we need to be able to test them and evaluate of their performance. Publicly available sample files make this much easier. Tool developers can use them to test their work, discover bugs, and hone their tools ready for others to use. A test corpus can contain real digital objects from a collection, or be created specifically for exhibiting certain characteristics for testing purposes. Real data, particularly with examples of broken, badly formed or corrupted files can be particularly useful.
Multi-format Corpora
- The OPF Format Corpus
- The iPres System Showcase Test Suite - Hosted by the UK Web Archive. Note that UKWA is offline at present.
- The Encyclopedia of Graphics File Formats Companion CD-ROM contains lots of test files for image formats:
- EDRM Data Set Files (archived version)
- digitalcorpora.org's corpora - including govdocs1.
- Open Preservation Foundation had a corpora page (archived version).
- digicam corpus - Contains a corpus of Digital Camera files collected by Tyler Thorsted.
- The Skeleton Test Suite - Builds test files from PRONOM binary and container signatures. These can be used to test DROID and other (compatible) identification tools.
- Fine Free File Test Suite - Set up for Fedora testing.
- JHOVE's test files
- JHOVE2's test files
- The disktype test files
- The Metadata Working Group specifications (archived version) and embedded image metadata test corpus (archived version)
- Apache Tika issue about setting up a nightly test corpus - See also tika-parsers/src/test/resources/test-documents
- The Chemical MIME Home Page
- Online-convert.com example files (use this link to browse the folder structure)
- RDSS Archivematica Test Data Corpus - A collection of research dataset files used for testing Archivematica integration and functionality in the JISC Research Data Shared Service (RDSS).
- Archivematica Sample Data - Includes OPF format corpus, as well as other test material.
- ExifTool test files
- PREFORMA Ground Truth Classes - Instructions how to reproduce validation-failing files for Matroska, FFV1, LPCM, TIFF, and PDF formats.
- "Small" - Collection of "the smallest possible syntactically valid files in different programming/scripting/markup languages."
- MediaArea-RegressionTestingFiles - Public regression testing files for MediaArea. Contains AVI, FLV, MPEG Audio, MOV, MPEG-4, MPEG-PS, and Matroska files.
- TechSlides sample files for web development (archived version) - Sample files for various image formats, video files, data structures, fonts, and web development files.
- Internet File Formats - Companion CD-ROM to Internet File Formats, contains Sample Files and some File Format Specifications for a variety of common file formats circa 1995.
- Apache Tika's regression corpus - Millions of files collected largely from govdocs1 and Common Crawl with oversampling on binary formats.
- Apache Tika's Bugtracker corpora - Dense set of problematic files -- attachments from bug trackers for open source parsers.
Format-specific Corpora
PDF
ePub
TIFF
JPEG2000
Web Archives
Databases
Building Corpora
If the existing corpora aren't cutting it, perhaps you can contribute to the OPF Format Corpus hosted on GitHub. There's a guide here on how to contribute (archived version) or you can contact OPF for help on how to get involved.
Sourcing test files from web archives
Web archives can provide a useful source of files of particular formats. For example, search via the UKWA interface. Note that UKWA is offline at present.
Find More Tools
Software tools give us the means the interrogate, manipulate, understand and ultimately preserve our digital data. The Community Owned digital Preservation Tool Registry, COPTR has unified five isolated tool registries. It provides an easy-to-edit wiki interface where we can share our knowledge about, and experiences with, tools used for digital preservation purposes.
Build Workflows
Resources to help build up preservation workflows, e.g. templates for how to use command-line tools, and how to chain things together.
Improve The Tools
Contributing to the development and improvement of tools is easy, even if you're not technical. Check out this guide to making small documentation edits, or raising issues on GitHub
Improving Identification
Identifying file formats is the bread and butter of digital preservation characterisation and assessment. Identification tool coverage and accuracy could be much better, and this primarily comes down to the signatures, or file format "magic", used to identify each format. You can help contribute and make our identification tools more effective here:
Improving Characterisation/Metadata Extraction
Deep file characterisation enables validation, identification of preservation risks and extraction of metadata. In developing a new characterisation capability, begin with thorough research to identify existing code to re-use or build on, develop a focused command line tool, then consider turning it into a JHOVE module.