AIPscan was developed to provide a more in-depth reporting solution for Archivematica users. It crawls METS files from AIPs in the Archivematica Storage Service to generate tabular and visual reports about repository holdings. It is designed to run as a stand-alone add-on to Archivematica. It only needs a valid Storage Service API key to fetch source data.
Apache License Version 2.0
Copyright Artefactual Systems Inc (2021)
AIPscan is a web-based application that is built using the Python Flask micro-framework. Below are the developer quickstart instructions. See INSTALL for production deployment instructions. See CONTRIBUTING for guidelines on how to contribute to the project, including how to create a new AIPscan report.
git clone https://github.com/artefactual-labs/AIPscan && cd AIPscan
virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements/base.txt
make build
export FLASK_CONFIG=dev
python run.py
localhost:5000
in your browser.. You should see a blank AIPscan
page like this:AIPscan can optionally be run using Typesense as a report data source, potentially reducing the time to generate reports. If Typesense is installed and enabled then AIPscan data will be automatically indexed after each fetch job and report queries will pull data from Typesense rather than the application's database.
Typesense can be installed a variety of ways detailed on their website.
Typesense configuration is done using the following environment variables:
TYPESENSE_API_KEY
TYPESENSE_HOST
(default "localhost")TYPESENSE_PORT
(default "8108")TYPESENSE_PROTOCOL
(default "http")TYPESENSE_TIMEOUT_SECONDS
(default "30")TYPESENSE_COLLECTION_PREFIX
(default "aipscan_")Typesense support is enabled by the setting of TYPESENSE_API_KEY
.
Here's an example:
TYPESENSE_API_KEY="xOxOxOxO" python run.py
Two CLI tools exist to manually indexed AIPscan's database and to see a summary of the Typesense index.
tools/index-refresh
tools/index-summary
Crawling and parsing many Archivematica AIP METS xml files at a time is resource intensive. Therefore, AIPscan uses the RabbitMQ message broker and the Celery task manager to coordinate this activity as background worker tasks. Both RabbitMQ and Celery must be running properly before attempting a METS fetch job.
You can downnload and install RabbitMQ server directly on your local or cloud machine or you can run it in either location from a Docker container.
docker run --rm \
-it \
--hostname my-rabbit \
-p 15672:15672 \
-p 5672:5672 rabbitmq:3-management
In another terminal window, start RabbitMQ queue manager:
export PATH=$PATH:/usr/local/sbin
sudo rabbitmq-server
http://localhost:15672/
guest
/ password: guest
:5672
.Celery is installed as a Python module dependency during the initial AIPscan
requirements import command: pip install -r requirements.txt
To start up Celery workers that are ready to receive tasks from RabbitMQ:
source venv/bin/activate
celery -A AIPscan.worker.celery worker --loglevel=info
Requires Docker CE and Docker Compose.
Clone the repository and go to its directory:
git clone https://github.com/artefactual-labs/AIPscan
cd AIPscan
Build images, initialize services, etc.:
docker-compose up -d
Optional: attach AIPscan to the Docker Archivematica container network directly:
docker-compose -f docker-compose.yml -f docker-compose.am-network.yml up -d
In this case, the AIPscan Storage Service record's URL field can be set with the Storage Service container name:
http://archivematica-storage-service:8000
Access the logs:
docker-compose logs -f aipscan rabbitmq celery-worker
Shut down the AIPscan Docker containers:
docker-compose down
Shut down the AIPscan Docker containers and remove the rabbitmq volumes:
docker-composer down --volumes
For production deployments, it's recommended to use MySQL instead of sqlite.
This can be achieved by exporting an environment variable named
SQLALCHEMY_DATABASE_URI
for celery and AIPscan services, that points to
mysql using format mysql+pymysql://user:pass@host/db
.
Sqlite databases can be migrated using sqlite3mysql
:
/usr/share/archivematica/virtualenvs/AIPscan/bin/pip install sqlite3-to-mysql
/usr/share/archivematica/virtualenvs/AIPscan/bin/sqlite3mysql -f aipscan.db -d <mysql database name> -u<mysql database user> ----mysql-password <mysql database password>
The tools
directory contains scripts that can be run by developers and system
adminsitrators.
The test data generator, tools/generate-test-data
, tool populates
AIPscan's databse with randomly generated example data.
The AIP fetch tool, tools/fetch_aips
, allows all, or a subset, of a storage
service's packages to be fetched by AIPscan. Any AIPs not yet fetched by
AIPscan will be added but no duplicates will be added if an AIP has already
been fetched. Any AIPs that have been newly marked as deleted will be removed
from AIPscan.
When using the script the storage service's list of packages can optionally be grouped into "pages" with each "page" containing a number of packages (specified by a command-line argument). So, for example, packages on a storage service with 150 packages on it could be fetched by fetching three pages of 50 packages. Likewise if the storage service has anything from 101 to 149 packages on it it could also be fetched by fetching three pages of 50 packages.
If using cron
, or some other scheduler, to automatically fetch AIPs using
this tool consider using the --lockfile
option to prevent overlapping
executions of the tool.
A storage service's list of packages is downloaded by the script and is cached so paging, if used, will remain consistent between script runs. The cache of a particular cached list of packages is identified by a "session descriptor". A session descriptor is specified by whoever runs the script and can be any alphanumeric identifier without spaces or special characters. It's used to name the directory in which fetch-related files are created.
Below is what the directory structure would end up looking like if the session
identifier "somedescriptor" was used, showing where the packages.json
file,
containing the list of a storage service's packages, would be put.
AIPscan/Aggregator/downloads/somedescriptor
├── mets
│ └── batch
└── packages
└── packages.json
NOTE: Each run of the script will generate a new fetch job database entry. These individual fetch jobs shouldn't be deleted, via the AIPscan web UI, until all fetch jobs (for each "page") have run. Otherwise the cached list of packages will be deleted and the package list will have to be downloaded again.
These should be run using the same system user and virtual environment that AIPscan is running under.
Here's how you would run the generate-test-data
tool, for example:
cd <path to AIPscan base directory>
sudo -u <AIPscan system user> /bin/bash
source <path to AIPscan virtual environment>/bin/activate
./tools/generate-test-data
In order to display a tool's CLI arguments and options, enter <path to tool> --help
.
To generate database documentation, using Schemaspy run via Docker, enter the following:
sudo make schema-docs
Database documentation will be output to the output
directory and viewable
by a web browser by opening index.html
.
localhost:5000
in your browser.https://amdemo.artefactual.com:8000