DILCISBoard / E-ARK-SIP

E-ARK SIP specification
https://earksip.dilcis.eu/
Creative Commons Attribution 4.0 International
7 stars 6 forks source link

Support for Shallow IPs #110

Open jmaferreira opened 2 years ago

jmaferreira commented 2 years ago

Feature request:

Add the possibility of building an SIP (or any other IP for that matter) that is able to move transport the information about the content files, but not the files themselves, i.e. it will only carry pointers to the files.

In theory this is already possible without breaking the current implementation, but it would be nice to add a section to the spec explaining how to accomplish this.

Comment from @karinbredenberg: This information should live on the guidelines and not on the spec (see https://guides.dilcis.eu/guideline/Guideline_IP_and_more_v1_0_0.pdf)

jmaferreira commented 1 year ago

According to @karinbredenberg this request violates one of the principles of the CSIP, however there are several use cases where this approach is quite useful.

@shsdev and @luis100 are able to provide some real world use cases.

I would like to bring this discussion to the next DILCIS Board meeting for approval. If this feature is approved, KEEP SOLUTIONS can describe its implementation approach.

Comment from @karinbredenberg: This information should live on the guidelines and not on the spec (see https://guides.dilcis.eu/guideline/Guideline_IP_and_more_v1_0_0.pdf)

luis100 commented 1 year ago

One of the use cases is about I/O logistics and performance, when unorganized data and the archive are in the same infrastruture and the pre-ingest must organize the data into SIPs and submit them into the archive. The process of copying the data into E-ARK SIP and then submitting the data into the archive, which will create E-ARK AIPs, ensuring the ingest was sucessful, remove the E-ARK SIP and then remove the unorganized data, cope with ingest failures, cope with SIP re-generation, all of this creates several issues in terms of storage space needed (which can go up to 3x the original data ammount, of the unorganized, SIP and AIP copies of it) and the I/O operations needed to move data around. Having shallow SIPs can greatly simplify the process, where the SIPs will only select and organize the data and the ingest process will copy it from the original place directly to the E-ARK AIP. The same situation can happen when mass-exporting data from storage systems like Amazon S3 or OpenStack Swift.

Another use case is when Shallow SIPs are allied with Shallow AIPs. Using a similar technique, an AIP can also refer remote files. Given that the OAIS repository is able to retrieve the file whenever it needs, for access or preservation operations, the capacity of referring external content can greatly reduce the ammount of local resources needed to manage a OAIS system, or can allow modern storage systems to be used as a backend for storing E-ARK AIPs, like Amazon S3 or OpenStack Swift. It can also reduce the overall amount of storage needed for a institution when the content is both in the production system and also in the OAIS archive. For example, a TV broadcast station would have its archive that would require a large amount of storage space, backups and remote replicas are already part of the system, but they would like to incorporate an OAIS archive. Duplicating storage is not an option, nor transfering the content into the OAIS archive and retrofit the production system to use the OAIS archive. The option would be for the OAIS archive to refer to the content in the production system, allow to create shallow SIPs and submit into the OAIS archive, the OAIS archive would need to be able to access the content and execute ingest workflow validations onto the remote data, preservation actions like fixity check, file format identification, file format validation, and file format conversion could be done using the remote data as input, every outcome will become a local file (including the outcome of the file format conversion actions). Here we recommend keeping all (descriptive, preservation, other) metadata local to the AIPs, but allow representation data to be remote.

In terms of changes, we suggest a change in the CSIP (so it would affect SIP, AIP and DIP) to allow representation data to be a non-local URL. Support for specific URL protocol might need to be added to E-ARK IP validators to ensure we can calculate and verify checksums.

jmaferreira commented 1 year ago

During the DILCIS board (2022-12-15) it was identified that this approach violates the following CSIP principles:

Principle 3.2: The Information Package SHOULD ensure that data and metadata are physically separated from one another.

In addition to the logical separation of components, it is beneficial to have data and metadata physically separated (i.e. formatted as individual computer files or clearly separated bitstreams). This allows digital preservation tools and systems to update respective data or metadata portions of an Information Package without endangering the integrity of the whole package.

Decision:

jmaferreira commented 8 months ago

@shsdev, @luis100 Are there any more examples of the user community endorsing this feature request?

luis100 commented 8 months ago

This is a strategy currently being in use by the Portuguese National Archives for the Distributed Digital Preservation service.

When this was presented, there were comments from some E-ARK partners that this strategy could help on large migration projects, specially when we need to wrap files in E-ARK SIP to be able to send them to an E-ARK compatible archive. Requiring files to be within the SIP will temporarily duplicate the amount of storage and requires a lot of I/O which may be unnecessary.

A similar strategy is also used in SIARD to segment SIARD files, for transfer but also for archiving, for example by the Danish National Archives. Although this does not solve the issue completely for SIARD, as a lot of data may not be in LOBS, it may solve the issue for SIPs, as we do not expect much data except for representation data files.

shsdev commented 8 months ago

I know about a case where a community member needs to keep large amounts of image data, and they are using S3 to store them. Packaging the data would create a lot of redundancy, therefore they would be interested in having this feature for specific use cases.

jmaferreira commented 8 months ago

Thanks for the examples @shsdev and @luis100.

I'm now re-reading Principle 3.2 and it says The Information Package SHOULD ensure that data and metadata are physically separated from one another.

Well, the suggested approach for the shallow IP not only does not violate this principle as it makes it stronger as the metadata will definitely be physically separated.

@karinbredenberg I believe we should add this discussion to the next DILCIS Board again.

karinbredenberg commented 8 months ago

The agenda for the next meeting is already filled it needs to be on the one after that, @jmaferreira

karinbredenberg commented 5 months ago

Following the decision at the DILCIS Board meeting on the 6th of February a working group is created. See: https://github.com/DILCISBoard/GroupDocumentation/blob/master/MeetingNotes/2024/20240206%20DILCIS%20Board%20and%20EARK%20CSP%20CORE11%20and%20others.md