Closed geo-mac closed 5 years ago
Plugin configuration file will include a list of metadata elements who's change would trigger a DP export. For example:
$c->{DPExport}=
{
trigger_fields => [
{ meta_fields => [ "title" ] },
{ meta_fields => [ "creators_name" ] },
{ meta_fields => [ "abstract" ] },
{ meta_fields => [ "date" ] }
]
}
If no trigger fields are defined, no DPExport will be triggered.
In addition to this, there should be a command line bin script that will export entire "live" archive dataset, or a list of eprintIDs.
This looks good, @photomedia!
Perhaps we should have a similar set of trigger fields for the possible actions that could happen with files? (i.e., a new file being uploaded, an existing file being deleted). Per @geo-mac's comment above, it seems like it's not safe to assume that these should always trigger an export, at least outside of the Concordia use case. And perhaps making it a bit more configurable could help with setting up alerts for deletions, so that we can do our due diligence to make sure any already-stored AIPs are also deleted when appropriate.
Maybe a useful question: what actions (if any) should always trigger an export across our different institutional use cases?
For files, we have events that are dealt with by the Indexer, for example:
Event::FilesModified
whenever a document is changed.
I think that the universal case is when an item is moved from buffer to live archive. The
With your preservation triggers are you expecting individual items to be archived off individually or are you also thinking a time bounding?
I ask because archiving individual 2MB pdfs would be positively wasteful, where if we can archive off a days worth of changes the archives become far more manageable.
Being able to do time based archives (Everything changed since X) would be useful - in both metadata only or full exports - especially when paired with other triggers.
So far we have been planning for each eprint to be archived independently. We discussed batch archives as well, but ultimately were convinced that for us, tracking, retrieval, deletions, and other management activities would be much easier if each eprint corresponded to 1 (or more, if versioned) discrete AIP.
It's also worth stating that we're anticipating a large batch ingest to Archivematica at the beginning, and then anticipate that the level of activity in our EPrints repo will not be so high as to make these individual preservation exports onerous. I could certainly see how in a higher-traffic environment alternative approaches might be better.
Interesting. We have looked at using Archivematica for individual items and felt the overhead was too great - but as you note differing workloads lead to differing conclusions.
I think that these are two separate design decisions:
(1) the mapping between an AIP in Archivematica and an eprint. (2) the question of timing, i.e., what triggers the archiving operations.
As for (1) We decided to have a 1 to 1 mapping, in that each eprint becomes one AIP.
However, this open issue is about triggers, so it is about (2). For this, I think we should have a robust configuration file that allows for many different options that will take into account the many different repository workflows. This configuration file should include the earlier mentioned trigger_fields who's change flag an eprint to be in need of preservation, but in addition to that, there should be a set of options for configuring when the preservation actions take place. These options should allow an administrator to decide if they want the preservation actions to take place:
I added a section on preservation triggers to the readme in https://github.com/eprintsug/EPrintsArchivematica/commit/428b1345c2d553e08234555bdc539ddeef71da04 and https://github.com/eprintsug/EPrintsArchivematica/commit/4364615ad50ad0e33569f0b5887e60045f08da37
The nature of the trigger for the "Digital Preservation Export" is something requiring control, and some great ideas have been raised by @photomedia. This feature could be satisfied via:
Enabling configuration of the rules governing the trigger, as proposed in a separate forum by @photomedia & @timothyryanwalsh. This entails a DP Export only being triggered when certain thresholds are satisfied, such as when specified metadata elements are updated and (obviously) file(s) updated. Simply being able to toggle 0/1 on an agreed set of metadata elements would suffice, probably.
Enabling configuration of the user permissions governing the trigger, to control for different repository set-ups. I think this is probably necessary for repos that operate less mediation in their deposit workflows (e.g. do you really want the majority of users triggering certain preservation actions, even if they have satisfied rules in #1, above? Maybe, but also maybe not...).
Similarly, those repositories supporting machine interactions with a CRIS (which is a growing use case) will have less control over when actions are performed on files and metadata - this includes us. Without control over this DP Exports could be triggered unnecessarily and repeatedly by software agents, with negative consequences for both EPrints and Archivematica.