NHMDenmark / DaSSCo-Tranche-1-work

DaSSCo Tranche 1 work
0 stars 1 forks source link

Asset Registry System (ARS): Development Phase 1 #29

Closed PipBrewer closed 4 months ago

PipBrewer commented 4 months ago

From finalised SKI agreement (Customer Mission Statement and Activity Plan)

1.1. Title, background and purpose of the assignment The Natural History Museum of Denmark (part of the University of Copenhagen) is leading a multi-institutional effort to digitise all natural history specimens in Denmark, called DaSSCo (Danish System of Scientific Collections). DaSSCo has received initial funding from the Ministry of Higher Education and Science to put the infrastructure in place to facilitate this ambitious plan. The overall purpose of DaSSCo is to make it easier for stakeholders such as researchers and the public to get access both to digital representations of the physical assets (primarily preserved animals, plants and fungi) and related information about the assets, thus facilitating more efficient research processes and engagement with the natural world. To do this, they will create digital surrogates of specimens and labels (primarily images, but also including audio files, point cloud data, and 3D imaging data such as Computed Tomography (CT) scans) and transcribe data from labels so that it is findable.

Danish natural history institutions together hold over 20 million physical objects, with data (such as the scientific names of the animals, plants and fungi held in their collections) about these physical assets (and their digital twins), stored in a database called Specify. Most DaSSCo institutions have their own (slightly differing) implementation of Specify. The different implementations of Specify are managed by the Natural History Museum of Denmark and the data is physically stored at the latter institution. Media (digital assets) associated with the specimens are held on a web asset server which communicates with Specify. Unfortunately, the current storage solution is slow, has limited functionality and is not scalable for DaSSCo (which anticipates adding and updating digital assets at a rate of up to 5 Petabytes every 5 years once the infrastructure is fully functional).

Access to digital assets will mostly be via Specify and pushed to a data portal (to be developed at a later stage in the project) and other publishers on the web. However, a subset of this data will not be available via Specify (e.g., CT scans or full resolution images). In these latter cases, access to the assets will need to be made direct to their storage location. In all cases access to the digital assets must be managed with an access control layer (this is not possible in the current solution).

To support the digitisation effort, project metadata captured during the digitisation process and selected data harvested from Specify needs to be captured in a registry to monitor progress, do quality assurance checks, and mediate access. Data should be able to flow both ways between the registry and Specify. This registry should capture the relations between the digital assets and metadata and must allow for queries in to this data structure. During the life cycle of digital assets, each asset will undergo a series of transformations (essentially an Extract, Transform, Load (ETL) process) which may lead to new metadata and new digital assets (e.g. a processed image may be created from an original raw image). The registry must support this ETL process workflow.

Digitised assets already exist and will need to be transferred. Larger scale digitisation of physical assets is planned to start in March 2023 and will gradually increase in scale. Data pipelines need to connect to the storage and registry and to Specify by April-May 2023, although some iterative development of the registry and connecting it with Specify can be delayed slightly beyond this date if necessary.

The tasks which form the basis of this document, are a key part of the IT infrastructure which needs to be put in place as part of DaSSCo. They are indicated in red in the process diagram below which shows the overall data pipelines which are being developed.

1.2. Description of the task Requirements for delivery All elements come under SKI Service area 2.5.1 Subarea: Establishing an IT architecture and include the implementation and testing of the solutions.

Architecture should be complete and storage/registry solution in place by April-May 2023. Iterative development of the registry interface and the synchronization with Specify may continue beyond April 2023 if necessary, but ideally complete before this date. Consultation and documentation must be in English. Documentation should be clear and concise, available in formats and language that can be used and interpreted by other IT professionals and key non-technical stakeholders.

Open-source solutions should be prioritized, with long-term maintenance costs minimized as much as possible. It is expected that maintenance, development and bug fixing beyond this consultancy will be delivered in-house.

Requirements for architecture (20% of task) Deliver a solution architecture for the storage and registry of digital assets associated with Museum specimens, which works alongside and interfaces with the physical specimen Collection Management System (Specify) and the University of Copenhagen IT architecture and systems. Task involves consultation and delivery of agreed overview of data models and data flows including full documentation. This task will involve some research of similar solutions within the sector (customer can assist with this). The architecture should incorporate relevant specifics detailed below such as access control.

Requirements for storage (20% of task) The physical storage location for all institutions involved in DaSSCo is based at the University of Copenhagen. Storage is compartmentalized based on the speed it can be accessed. DaSSCo media and associated metadata will be stored in at least 2 compartments. Currently media available via Specify are accessed via a web asset server. However, that server is slow and limited in functionality. There are also concerns regarding the scalability of this current storage solution. When DaSSCo is fully operational, data could be added at a rate of up to 1 Petabyte per year. Uploading, finding, and downloading this data must be fast and efficient. Other requirements for the storage solution will be controlled access, an ability to link assets within it to external publishers and an ability to harvest from and share data with the physical specimen Collection Management System (Specify).

Specifications for controlled access are: • Prevent mass download in one go, particularly of large files • Need controlled access to non-images (documents) • Embargos or restrictions on individual data (this information is held in metadata file and derives from Specify). Assets with these restrictions should be shareable on a case-by-case basis. • There should be roles with different access levels (full administrator access, expert user and read only user access) for internal access to storage and metadata. • Editing a digital asset will result in a new asset and associated metadata record being created. Editing of the associated metadata record should be controlled through roles, but should be possible.

The media stored will largely be images of physical specimens and their labels (the physical specimens are uniquely identified by the combination of their institution name, collection name and barcode number). Following the transfer of data from the existing web asset server, new media will be derived from 2 main sources: harvesting from Specify in response to an upload event in Specify and upload following data processing as part of a mass digitisation pipeline. Both are expected to be accompanied by EXIF data embedded in the asset and a metadata file (e.g., as JSON).

Locations within the storage should be permanently resolvable to a unique URL, which can be shared externally. It is expected that the unique path will be a GUID which will also form the file name or handle of the asset. Each digital asset, regardless of whether it is a derivative (such as a down-sampled version of another image) will have a unique file name (or GUID). A future requirement (not part of current consultancy) may include the use of other types of persistent identifiers such as DOIs.

The proposed architecture and solutions should be documented and discussed with the customer, who will approve them prior to the next stages of work commencing.

Requirements for registry (30% of task) Each asset in the storage will have accompanying metadata which needs to be stored, searched, retrieved, viewed and edited (individually or as a bulk operation). It must be possible to construct searches using multiple search parameters across multiple metadata fields, using Boolean operators and including date ranges. The results of a search should be available as summaries for reporting or sets of metadata records which can be scrutinized. It should be possible to navigate through metadata records in a GUI and access the related digital asset described by the metadata via a simple click in the GUI. The metadata fields will be provided, but the system must be flexible enough to enable changing or adding fields. Access to the metadata is controlled through user roles (full administrator access, expert user and read only user access). Values in the metadata fields will dictate external user access to the digital assets (e.g., whether that asset is under embargo). A unique identifier will be assigned to each digital asset and this will also be assigned to the associated metadata record. All changes to metadata records should be recorded and this audit history easily viewable in perpetuity.

Metadata associated with new digital assets added to the storage areas will be derived from two sources: a data processing pipeline (essentially an ETL process) and Specify. For the data processing pipeline, the metadata is currently deposited in the storage areas with the digital assets in the form of individual JSON files. For Specify, the format of that data needs to be decided, but one solution would be to send the digital assets and associated metadata through the above data processing pipeline to deposit the assets, their derivatives and JSON files in the various storage compartments as well. However, we can imagine a scenario where the metadata is created directly in the registry which then undergoes a series of transformations (e.g., to get enriched). As part of the process of adding new digital assets (and creating the associated metadata records), the original file names should be recorded in a field (for audit purposes), then the assets and their metadata records should be renamed with unique GUIDs and given persistent URLs incorporating the GUID (unique to each digital asset).

Whilst digital assets may be linked directly to external users or publishers, or via Specify, the metadata associated with them are for internal use and project reporting, as well as for facilitating synchronisation with Specify. It will also form part of the data pipeline for populating Specify with information about the assets.

A system should be put in place to back up the registry’s metadata records on a regular basis.

Requirements for synchronization with Specify (10% of task) All digital assets added to the storage areas used by DaSSCo need to be synchronized with the individual DaSSCo institutions’ implementations of their specimen collections management system (Specify).

For digital assets which are derived from the mass digitisation data processing pipeline, a selection of the metadata (GUID, URL, date digital asset created, institution, collection, specimen barcode, title of image derived from a couple of metadata fields and image type) along with EXIF data embedded in the digital assets will be used to populate records in the correct installation of Specify (identified via the Institution name metadata field) during the ingest process. The ingest process will create attachment records in the attachments table and where the specimen barcode field is populated, the attachment record and media item will be linked to the specimen record (matched via the barcode and collection name). Where a specimen record with that barcode number does not exist, one will be created and linked to the attachment record and media item. Harvesting of digital assets and associated metadata should be scheduled.

For digital assets which are a result of a user uploading them directly to Specify, the digital asset(s) will be added to one of the storage areas and selected metadata (GUID, URL, date digital asset created, institution, collection, specimen barcode, image type, copyright owner, license, embargo/exception type, original file name and internal record number) will be harvested from Specify and used to create associated metadata records. After renaming of the digital asset using a unique GUID and assigning it a URL, this additional data should be back populated into Specify so that the original filename assigned by Specify is overwritten (this is preserved in the metadata record in the registry for audit purposes). This whole process should be initiated when a user creates a new attachment record.

When a digital asset is downloaded by a user via Specify, Specify will call the server and need to be redirected. This process needs to be put in place.

Finally, on establishment of the storage solution and registry, transfer of existing data needs to occur from the current web asset server.

Requirements for testing (10% of task) Testing should be incorporated into above tasks, working in an Agile manner; however, a period of testing the entire implementation and bug fixing and review is also anticipated. The customer will be involved in this prior to sign off.

Requirements for documentation and handover (10% of task) Diagramatic overviews for IT professionals to incorporate into larger architecture and systems, as well as understandable by Steering Group, including data models and data flows. Documentation (written and diagrammatic) should be sufficiently detailed for ongoing maintenance and modification. Major decisions agreed should be documented and any associated risks. Any program code and configuration files developed for the different tasks should also be delivered with a concise description of the different parts. Final handover should talk through system implemented and suggested future development.

Image