Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
104 stars 71 forks source link

Export a repository object (node, media, files) #1096

Open Natkeeran opened 5 years ago

Natkeeran commented 5 years ago

We need to be able to export a digital repository object fully for various uses cases including migration and preservation (AIP/Bags).

Additional Info:

We probably need a method to ingest the exported object as well.

mjordan commented 5 years ago

If a command-line tool external to Drupal is sufficient, try https://github.com/mjordan/islandora_bagger.

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Islandora-CLAW/CLAW/issues/1096, or mute the thread https://github.com/notifications/unsubscribe-auth/AADCTTRFZX5UZMF2HT2MK4LPRCCKHANCNFSM4HG5UXHQ .

rangel35 commented 5 years ago

What we do at UT-Austin in I-7 is use the bagging feature so our users can request bags for preservation purposes and offsite vaulting. we use both bagging via the interface for bags under 2g and for bags over 2g they get queued for drush processing and bagged overnight.

We provide the ability to bag ALL datatstreams and metadata of the object and for paged content it will also bag the "pages" and their datastreams and metadata

Our users have also requested the ability to bag selected datastreams

Natkeeran commented 5 years ago

UTSC has a similar use case and workflow as noted by @rangel35. A mods flag indicates which objects can be bagged. A report is generated with pids. The objects are exported via command line using drush.

(We have considered adding a premis event on bag creation. As it seemed to complicate the workflow, we did not implement that).

In islandora 7.x we bag the full atom zip (including versions) with archive context. In one of the storage locations, we aim to do validation of bags as well.

In 7.x, we run into problems exporting large objects or collections consistently, thus command line seems to work best.

Having the option to bag from UI and Islandora API is nice to have as we don't have a way to download the whole object right now.

Also, it would be ideal to have an option to ingest from a bag or another export format.

mjordan commented 5 years ago

Some preliminary thoughts on a Bagging microservice:

  1. Take something like Islandora Bagger and put a REST interface on top of it.
  2. From withing Islandora 8, a user chooses to Bag an object via the GUI, which POSTs a message to the microservice containing the node ID, which then creates the Bag like Islandora Bagger does now by fetching the various files, metadata, etc from Islandora via Islandora's REST interface. The module running in Drupal doesn't do any bagging, it just sends the request to create a Bag (and maybe exposes the results of the Bagging process back to the user, see next point).
  3. On successful creation of the Bag, the microservice sends an email to the user containing the URL of the Bag to download (or some indication of where the bag can be found); or alternatively, the new Bag's URL is provided via the microservice's REST interface so it can show up in a Drupal View, etc.
  4. The microservice would retain its command-line UI so it can be incorporated into automation scripts, etc.

Having a microservice separate from Drupal do the bagging would allow the jobs to run as long as they needed to, eliminating the risk of timing out in front of the user because the bagging is done asyncronously. We'd need to figure out how to allow for different Bag options, but those could possibly be sent as the REST POST request's body or something.

@Natkeeran with regard to ingest from a Bag, that is something that users have been asking for for a while. But, with Islandora 8's nice REST interfaces, we can probably figure out how to map the contents of a Bagged object back to the originating components of the node+media fairly easily and push it into Islandora using something like https://github.com/mjordan/claw_rest_ingester. I think using URIs to define what taxonomy terms should be assigned to the reingested object would be useful here as well.

dannylamb commented 5 years ago

@mjordan @Natkeeran I would love to see bags (or zipped bags, really), be the new zip importer format. I don't know how possible that is given how widely bags can vary, but it makes sense to move away from a bespoke format to a more widely adopted one.

Natkeeran commented 5 years ago

@mjordan @dannylamb

The feature set for microservice looks good. We can extended it later in the Drupal side to have a flag and queue/cron mechanism.

Ingest would be a neat addition, with use cases such as restore from backup, migration and batch ingest from zip. Having ingest from zip can theoretically be seen as bootstrapping Drupal from Fedora as well.

Some points to consider:

mjordan commented 5 years ago

@Natkeeran yes, those are all significant issues, but I see them as out of scope for the Bagit functionality. They are more data modeling issues, aren't they?

@dannylamb couldn't agree more. Even if an institution hasn't adopted Bagit widely, the tooling is decent and it is always easier to convert from a standard format than from a bespoke one, especially from a long-term preservation perspective (e.g. the platform tied to the bespoke format hasn't been in use in 20 years....).

rangel35 commented 5 years ago

I-8 creates a UUID couldn't we use that as the PID? or are you thinking more along the standard namespace type PID?

mjordan commented 5 years ago

In order for the creation of Bags to be truly decoupled from the Drupal module POSTing the request, we either need to issue the request using an asynchronous Guzzle call or using an asynchronous Javascript request, or do something on the microservice side that collects node IDs in a file and then runs as a batched cron job.

One advantage of the batch approach is that since the bagger would be running in a CLI environment, it wouldn't time out like it would if the bags were generated within an HTTP response.

mjordan commented 5 years ago

Did some work on Islandora Bagger over the weekend. It now has a REST API that lets you add a node ID and settings file to a queue. It also has a simple FIFO queue manager, and a console command to process the queue. The original CLI create_bag_ command still works as it used to.

The README explains how it works: as PUT requests like this come in:

curl -v -X POST -H "Islandora-Node-ID: 4" --data-binary "@sample_config.yml" http://127.0.0.1:8001/api/createbag

each request's node IDs is added to the queue, along with the path to the settings YAML file (which is the body of the request). In a cronjob, you would run the following to process the queue:

./bin/console app:islandora_bagger:process_queue --queue var/islandora_bagger.queue

which loops through the queue and runs the create_bag CLI command (it does this using internal Symfony methods):

./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112

rosiel commented 4 years ago

The Robertson Library's RDM project uses Mark Jordan's Islandora Bagger and integration module.

We have a BagIt ansible role which installs our fork of islandora_bagger and of islandora_bagger_integration.