Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
104 stars 71 forks source link

Bulk Ingest #244

Closed ruebot closed 8 years ago

ruebot commented 8 years ago

Issue by daniel-dgi Tuesday Feb 03, 2015 at 15:31 GMT Originally opened as https://github.com/islandora-interest-groups/Islandora-Fedora4-Interest-Group/issues/13


Reformatting this to use the Use Case template.

Title (Goal) Bulk Ingest Objects into Fedora
Primary Actor Repository architect, implementer
Scope Architecture
Level Low
Story As a repository architect, I want to be able to ingest large numbers of files into Fedora using as little programming as possible

Remarks:

ruebot commented 8 years ago

Comment by ruebot Tuesday Feb 03, 2015 at 15:33 GMT


I really like the idea of using a manifest, and I think it would be great with we stuck with a directory convention like BagIt. That would take care of the manifest, and a user could verify checksum as they are ingested.

ruebot commented 8 years ago

Comment by mjordan Tuesday Feb 03, 2015 at 16:20 GMT


Of course I'd vote for BagIt, for the reasons @ruebot mentions. But, I'd be cautious about requiring it since not all sites will have or want to convert their stuff to Bags. Then again, if we're going to require a manifest, requiring BagIt is not all that different.

ruebot commented 8 years ago

Comment by daniel-dgi Tuesday Feb 03, 2015 at 18:04 GMT


I'm not terribly familiar with BagIt. It's not something that I've dealt with in my work for clients. But at first glance it seems pretty appropriate.

METS is another option, I guess. Or we could just use a simple JSON or YAML manifest, but something tells me an actual metadata standard would make people feel better about things.

Other than BagIt (which I'm assuming contains all the data in one package), we could probably get away with just dropping the manifest in the watch folder, so long as it details the location of files and the user running the camel process has access to those locations.

ruebot commented 8 years ago

Comment by awoods Tuesday Feb 03, 2015 at 18:50 GMT


@daniel-dgi, "holey" bags are also an option if not all of the data is available in the package, with the optional fetch.txt file. See: http://tools.ietf.org/html/draft-kunze-bagit-06#section-2.2.3

ruebot commented 8 years ago

Comment by ruebot Thursday Feb 19, 2015 at 18:48 GMT


Adding fcrepo and upgration tags since this could also inform the proposed upgration migration tool discussed on today's Fedora Tech call.

ruebot commented 8 years ago

Comment by dmoses Saturday Oct 17, 2015 at 23:16 GMT


I think one of the most common patterns in the Drupal community for batch ingesting is using Feeds. It has a number of suppport modules for importing XML as well. @mjordan wrote a module a while back. BagIt would be good choice too and may add predictability to the ingest process.

ruebot commented 8 years ago

Comment by daniel-dgi Tuesday Oct 20, 2015 at 14:21 GMT


Thanks for being awesome, @dmoses. Feeds seem attractive from a Drupal front end point of view. Could maybe parse rdfxml? Would like to hear what @mjordan has to say about pros/cons of using feeds and nodes. His module means he's probably got the most experience in that realm of Drupal land.

Not the first time bags have come up, either. I'm interested in seeing if we can zip them and use them to replace our hand-rolled format for zip importer. Are bags of bags possible, as well? It would be amazing if we could mimic what we're doing in 1.x batch but with a well defined standard.

ruebot commented 8 years ago

Comment by ruebot Tuesday Oct 20, 2015 at 15:52 GMT


Serialized bags are totally a thing. Are you thinking of the book and newspaper batch ingest w/r/t the bags in bags idea?

ruebot commented 8 years ago

Comment by mjordan Tuesday Oct 20, 2015 at 16:18 GMT


@daniel-dgi Bags are agnostic to the content in their 'data' directory and that content's organization, so as @ruebot says, it's legal to have a Bag of Bags. The child Bags would just be serialized into .zip or .tgz files.

To answer your question about nodes in Islandora Feeds, I took that approach because 1) it was easy/I am lazy and 2) it uncouples the steps of importing data and committing that data to the Fedora repo as objects. For example, you can perform various types of QA on the nodes before using Views Batch Operations to create the Islandora objects, add other datastreams, etc.

I wrote that module about two years ago, in fact, I started it at OR3013, with @dmoses, @ruebot and some of the usual suspects sitting right beside me in the back few rows of seats. Now that we have a clear path for Islandora 7.x-2.x, it makes even more sense to create nodes (for obvious reasons) than it did then.

A back of the envelope diagram for using an existing tool like Feeds to manage the import and Bags to wrap file assets might look something like: Feeds creates Drupal nodes that contain F4 object properties (maybe using a Feeds RDF parser?), with pointers to Bags on the Drupal filesystem. Each Bag contains the file assets for an Islandora object. The organization of the content within each Bag would likely be specific to each content model (basic image, newspaper issue, book, etc.). It is legal to also include a (non-Bag) manifest that represents the content model in some way e.g., OAI-ORE, METS), so we might want to explore that option as well.

Using both Feeds and Bags like this is probably overkill, and preparing the Bags would put an additional burden on content handlers. But, there are a lot of other benefits to Bags that may justify that burden, like built-in checksum generation and packaging. Using holey Bags as @awoods points out would add even more flexibility.

ruebot commented 8 years ago

Comment by daniel-dgi Tuesday Oct 20, 2015 at 16:24 GMT


Maybe we're really talking about two things here? Just using feeds to import nodes, and then zipped bags as a zip importer replacement? Heck, we could even just accept zip files on our services endpoints and use that to consume entire objects as opposed to the multipart/form-data shenanigans I've got going on right now.

Would be nice to use bags in that way since it's a drupal agnostic fashion to move things around. Within Drupal, feeds definitely seems like a great way to go. Maybe we should make a ticket for someone to dabble around?

This is getting interesting :)

ruebot commented 8 years ago

Comment by manez Tuesday Oct 20, 2015 at 16:28 GMT


My (probably not typical) use case would be vastly improved by a bulk export/ingest interface - some way to pull down a small bunch of objects and their metadata, then upload them back up to another Islandora site. Sounds like that's something in the Bags wheelhouse?

That said, +1 for Feeds being a nice GUI/Drupal-y way to import

ruebot commented 8 years ago

Comment by mjordan Tuesday Oct 20, 2015 at 16:36 GMT


My (recyclable envelope) diagram used both Feeds and Bags because AFAIK Feeds doesn't deal with file assets in any standardized way and I was assuming that the nodes created by Feeds would have some binary files hanging off them. But, the two could be completely separate. Will jump back into the discussion later, must attend all the meetings now :disappointed:

ruebot commented 8 years ago

Comment by daniel-dgi Tuesday Oct 20, 2015 at 19:10 GMT


@mjordan ah, i see. wasn't thinking about feeds not being able to handle files.

ruebot commented 8 years ago

Comment by dmoses Tuesday Oct 20, 2015 at 19:33 GMT


I've got the 7.x.2 vm downloaded ... you can do files with feeds. I will investigate and try a proof of concept. Potentially?? it could be another migration tool by parsing the FOXML xml ... which includes paths to the binaries. Not sure. Will report back.

ruebot commented 8 years ago

Since we've discussed Bagit bags here a fair bit, I might be worth making sure the planned Fedora Import/Export sprint is on their radar.

dannylamb commented 8 years ago

Closing old use cases until after MVP doc is released.