cwrc / leaf-isle-bagger

Create Archival Information Packages (AIP) from the LEAF environment
GNU General Public License v3.0
1 stars 0 forks source link

LEAF Bagger

Supports a preservation workflow through the creation of Archival Information Packages (AIP) from a LEAF environment.

:warning: These command-line scripts are only compatible with CWRC Repository v2.0 based on Linked Academic Editing Framework (LEAF). This replaces preservation workflow: cwrc_preservation used with the CWRC Repository v1.0 as v1.0 reached end-of-life Jan 5th, 2025.

Overview

The LEAF Bagger preservation toolkit contains scripts supporting a preservation workflow for a LEAF environment. The primary objective is to manage the flow of content from the CWRC repository into an OpenStack Swift repository for preservation (the destination may be extended for partner projects). Also, the repository provides an application to audit the contents of the source and preserved objects. The scripts are deployable within an OCI container to align with the deployment of CWRC Repository v2.0 and other LEAF installations.

Overview: Preservation workflow

What is CWRC

CWRC (Canadian Writing Research Collaboratory) is an “online infrastructure for literary research in and about Canada designed to meet the challenges and embrace the opportunities of the digital turn.” In other words, CWRC is a living repository (i.e., contains content that may be updated, for example, as facts and assertions about a person, place, or event are discovered or changed). CWRC, as of August 2023, contains ~410,000 objects accounting for 1TB+ in storage. The content comes from multiple research projects created by researchers located in many areas of Canada.

CWRC infrastructure is hosted with the Digital Research Alliance of Canada on the Arbutus Cloud hosted at the University of Victoria with data backups hosted in London ON and preservation with UofA Library (via OLRC).

Requirements

The Dockerfile in this repository describes the requirements and setup. An overview of requirements includes:

Workflow: Archival Package Creation

The preservation workflow acts on a polling model where a script runs at regular intervals asking the repository for a list of new/changed items within a given window of time. Any new/changed item has an archival information package generated (AIP) and added to the preservation endpoint.

leaf-bagger.py

Result: a report of items added to the preservation endpoint.

How to recover from isolated failures

ToDo: what if a small percentage of items in a preservation run fail?

Workflow: Archival Package Audit

The preservation workflow includes an audit step checking, in a basic way, that what exists in the repository is preserved in the preservation endpoint. This step assumes the AIP creation script will fail in unexpected ways and tries to act as a second set of eyes to identify and report failures.

leaf-bagger-audit.py

Result: an audit report indicating the status of all nodes in the repository and their preservation status in a CSV file.

Tests & linting

The Nox Python automation tooling helps automate testing and linting. The tool is integrated as part of the CI/CD. The noxfile.py contains the configuration.

Install as per your OS, e.g., apt install nox

To run tests and linting:

nox

To run only tests

nox -s test

To run only linting

nox -s lint

To run tests outside nox

python3 -m venv ./rootfs/leaf-isle-bagger/venv
./rootfs/leaf-isle-bagger/venv/bin/python3 -m pip install -r rootfs/leaf-isle-bagger/requirements.txt
./rootfs/leaf-isle-bagger/venv/bin/python3 -m pip install -r rootfs/leaf-isle-bagger/requirements_test.txt
./rootfs/leaf-isle-bagger/venv/bin/pytest rootfs/leaf-isle-bagger/tests/

CI/CD

Deployment

The scripts are meant to be executed within a containerized environment. For alternate approaches, review the Dockerfile layers for installation and docker-compose.yml for environment variable settings.

How to run from within a container

Dependencies

The OCI container image is based on the isle-bagger image and isle-buildkit. Access to a Drupal site is also required with the container running within a [leaf-base-i8] container deployment or independently (i.e., in a separate deployment).

Settings

Local settings: see isle-bagger and parent containers for more settings (e.g., islandora-bagger tool settings). .env.sample contains a sample .env for docker-compose

Environment Variable Default Description
LEAF_BAGGER_APP_DIR /var/www/leaf-isle-bagger/ The installed directory of islandora-bagger
LEAF_BAGGER_OUTPUT_DIR /data/log/ Report location describing AIP creation & upload
LEAF_BAGGER_AUDIT_OUTPUT_DIR /data/log/ Audit report location
LEAF_BAGGER_CROND_DATE_WINDOW 86400 Time window; return new/changed items in the last "x" seconds
OS_CONTAINER OpenStack container name
OS_AUTH_URL OpenStack auth URL
OS_PROJECT_ID OpenStack project ID
OS_PROJECT_NAME OpenStack project name
OS_USER_DOMAIN_NAME OpenStack user domain name
OS_PROJECT_DOMAIN_ID OpenStack project domain id
OS_USERNAME OpenStack user name
OS_REGION_NAME OpenStack region name
OS_INTERFACE OpenStack interface
OS_IDENTITY_API_VERSION OpenStack identity API version

Two docker-compose secrets are also used

Secret Description
BAGGER_DRUPAL_DEFAULT_ACCOUNT_PASSWORD Drupal site password
OS_PASSWORD OpenStack user password

Docker-compose env vars

Environment Variable Default Description
LOCAL_AIP_DIR Set when using a bind mount; otherwise a Docker volume
LEAF_BAGGER_REPOSITORY ghcr.io/cwrc IOC image repository for the LEAF Bagger image; defaults to a local build
LEAF_BAGGER_TAG latest IOC image tag name for the LEAF Bagger image; defaults to latest for a local build
BAGGER_REPOSITORY ghcr.io/cwrc IOC image repository for the Isle Bagger image; defaults to a local build
BAGGER_TAG latest IOC image tag name for the Isle Bagger image; defaults to latest for a local build

Updates

How to update the base?

For an isle-buildkit update (gist, follow dependencies in the Dockerfile layers)

Note: if wanting to test leaf-isle-bagger and isle-bagger locally

See the following as an alternative to specifying an OCI image registry and tag in the Dockerfile: https://docs.docker.com/build/bake/reference/. As an example, see isle-buildkit docker-bake.hcl.