Open Natkeeran opened 5 years ago
If a command-line tool external to Drupal is sufficient, try https://github.com/mjordan/islandora_bagger.
—
You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Islandora-CLAW/CLAW/issues/1096, or mute the thread https://github.com/notifications/unsubscribe-auth/AADCTTRFZX5UZMF2HT2MK4LPRCCKHANCNFSM4HG5UXHQ .
What we do at UT-Austin in I-7 is use the bagging feature so our users can request bags for preservation purposes and offsite vaulting. we use both bagging via the interface for bags under 2g and for bags over 2g they get queued for drush processing and bagged overnight.
We provide the ability to bag ALL datatstreams and metadata of the object and for paged content it will also bag the "pages" and their datastreams and metadata
Our users have also requested the ability to bag selected datastreams
UTSC has a similar use case and workflow as noted by @rangel35. A mods flag indicates which objects can be bagged. A report is generated with pids. The objects are exported via command line using drush.
(We have considered adding a premis event on bag creation. As it seemed to complicate the workflow, we did not implement that).
In islandora 7.x we bag the full atom zip (including versions) with archive context. In one of the storage locations, we aim to do validation of bags as well.
In 7.x, we run into problems exporting large objects or collections consistently, thus command line seems to work best.
Having the option to bag from UI and Islandora API is nice to have as we don't have a way to download the whole object right now.
Also, it would be ideal to have an option to ingest from a bag or another export format.
Some preliminary thoughts on a Bagging microservice:
Having a microservice separate from Drupal do the bagging would allow the jobs to run as long as they needed to, eliminating the risk of timing out in front of the user because the bagging is done asyncronously. We'd need to figure out how to allow for different Bag options, but those could possibly be sent as the REST POST request's body or something.
@Natkeeran with regard to ingest from a Bag, that is something that users have been asking for for a while. But, with Islandora 8's nice REST interfaces, we can probably figure out how to map the contents of a Bagged object back to the originating components of the node+media fairly easily and push it into Islandora using something like https://github.com/mjordan/claw_rest_ingester. I think using URIs to define what taxonomy terms should be assigned to the reingested object would be useful here as well.
@mjordan @Natkeeran I would love to see bags (or zipped bags, really), be the new zip importer format. I don't know how possible that is given how widely bags can vary, but it makes sense to move away from a bespoke format to a more widely adopted one.
@mjordan @dannylamb
The feature set for microservice looks good. We can extended it later in the Drupal side to have a flag and queue/cron mechanism.
Ingest would be a neat addition, with use cases such as restore from backup, migration and batch ingest from zip. Having ingest from zip can theoretically be seen as bootstrapping Drupal from Fedora as well.
Some points to consider:
Exporting and importing the full graph is the major challenge. For example, a person has relationships to other people. I don't know enough graph theory to determine how to find the full graph, and how to avoid circular loops!
The second related challenge is persistent identifiers. What is our PID? If Drupal nid or taxonomy id the pid, then does Drupal allow us setting a PID. Do we want to support use case where people install Islandora 8 in an existing instance of Drupal! Do we have a persistent ID in Fedora?
Does Fedora or Drupal representation provide the logical representation of the full repository object similar to FOXML in 7.x? Maybe, via Portland Common Data Model? Though this adds a level of complexity, do we need such a representation (i.e METS) for preservation (OAIS AIP compliance) purposes?
We should be clear about how we are handling conceptual entities (i.e books, compound objects).
@Natkeeran yes, those are all significant issues, but I see them as out of scope for the Bagit functionality. They are more data modeling issues, aren't they?
@dannylamb couldn't agree more. Even if an institution hasn't adopted Bagit widely, the tooling is decent and it is always easier to convert from a standard format than from a bespoke one, especially from a long-term preservation perspective (e.g. the platform tied to the bespoke format hasn't been in use in 20 years....).
I-8 creates a UUID couldn't we use that as the PID? or are you thinking more along the standard namespace type PID?
In order for the creation of Bags to be truly decoupled from the Drupal module POST
ing the request, we either need to issue the request using an asynchronous Guzzle call or using an asynchronous Javascript request, or do something on the microservice side that collects node IDs in a file and then runs as a batched cron job.
One advantage of the batch approach is that since the bagger would be running in a CLI environment, it wouldn't time out like it would if the bags were generated within an HTTP response.
Did some work on Islandora Bagger over the weekend. It now has a REST API that lets you add a node ID and settings file to a queue. It also has a simple FIFO queue manager, and a console command to process the queue. The original CLI create_bag_
command still works as it used to.
The README explains how it works: as PUT
requests like this come in:
curl -v -X POST -H "Islandora-Node-ID: 4" --data-binary "@sample_config.yml" http://127.0.0.1:8001/api/createbag
each request's node IDs is added to the queue, along with the path to the settings YAML file (which is the body of the request). In a cronjob, you would run the following to process the queue:
./bin/console app:islandora_bagger:process_queue --queue var/islandora_bagger.queue
which loops through the queue and runs the create_bag
CLI command (it does this using internal Symfony methods):
./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112
The Robertson Library's RDM project uses Mark Jordan's Islandora Bagger and integration module.
We have a BagIt ansible role which installs our fork of islandora_bagger and of islandora_bagger_integration.
We need to be able to export a digital repository object fully for various uses cases including migration and preservation (AIP/Bags).
Additional Info:
We probably need a method to ingest the exported object as well.