We use the hybag gem and the bagit library to handle bags. Both gems haven't
seen updates in a while, and both gems (in my opinion) overcomplicate what
could be a very simple local process.
The bag spec is absurdly simple. I propose that we do not use any libraries
and just handle bags within our app's code. If we find that our code really
makes sense to the community, maybe we push it somewhere, but looking at hybag
I'm not seeing a lot of benefit to the extra work that's been done to try and
make the thing generic.
One issue I see with the current approach is the "one bag per asset" mindset.
By only putting one asset in each bag, we're adding a fair amount of overhead.
A bulk ingest of 5000 assets means reading 5000 separate bag manifests, rather
than just one.
Additionally, verification of bags becomes meaningless. The point of a bag is
to ensure validation in case files were missed or corrupted during transmission
of data. When each asset is its own bag, we can validate that an individual
asset got to its destination properly, but there is no automated mechanism to
ensure the whole batch of assets made it. If we're expecting 5500 assets and
only get 5400 of them, nothing in the application will know.
We already have so many places with too much room for human error. Bags are
trivial to handle in our own codebase. It's not worth the extra error
potential to use the current "one bag per asset" process. It's not worth the
extra debugging and maintenance to use the bagit and hybag libraries.
As a possibly separate issue, in a perfect world, I'd like to see the bag
ingest doing something with workflow metadata so that when an asset comes from
a bag, we can identify the bag from which it came, and we can see if the bag as
a whole was ingested properly. Items ingested from bags shouldn't be visible
or even reviewable unless the whole bag ingest succeeded. If a bag ingest
fails, all items from that bag should be purged from fedora.
We use the hybag gem and the bagit library to handle bags. Both gems haven't seen updates in a while, and both gems (in my opinion) overcomplicate what could be a very simple local process.
The bag spec is absurdly simple. I propose that we do not use any libraries and just handle bags within our app's code. If we find that our code really makes sense to the community, maybe we push it somewhere, but looking at hybag I'm not seeing a lot of benefit to the extra work that's been done to try and make the thing generic.
One issue I see with the current approach is the "one bag per asset" mindset. By only putting one asset in each bag, we're adding a fair amount of overhead. A bulk ingest of 5000 assets means reading 5000 separate bag manifests, rather than just one.
Additionally, verification of bags becomes meaningless. The point of a bag is to ensure validation in case files were missed or corrupted during transmission of data. When each asset is its own bag, we can validate that an individual asset got to its destination properly, but there is no automated mechanism to ensure the whole batch of assets made it. If we're expecting 5500 assets and only get 5400 of them, nothing in the application will know.
We already have so many places with too much room for human error. Bags are trivial to handle in our own codebase. It's not worth the extra error potential to use the current "one bag per asset" process. It's not worth the extra debugging and maintenance to use the bagit and hybag libraries.
As a possibly separate issue, in a perfect world, I'd like to see the bag ingest doing something with workflow metadata so that when an asset comes from a bag, we can identify the bag from which it came, and we can see if the bag as a whole was ingested properly. Items ingested from bags shouldn't be visible or even reviewable unless the whole bag ingest succeeded. If a bag ingest fails, all items from that bag should be purged from fedora.