On detecting deleted files across versions

marcolarosa commented 3 years ago

Conversation moved from https://github.com/OCFL/spec/issues/525

In https://github.com/OCFL/spec/issues/522 I talked about a way we might use S3 as a backend. One point I made is that versioning would require pulling down from S3 the entire object which would be terrible in the case of very large objects (TB sized objects but even a few GB would make updates slow).

In his reply @pwinckles stated that the most recent inventory is likely the only thing needed.

This ticket is about thinking through how to detect that a file has been deleted in the next version without needing the whole object (which I can't see is possible - hence why I'm asking for help!).

Consider the following:

  v1                                     v2
  |- File A - hash X                     |- File A - hash X

No change; do not create new version.

  v1                                     v2
  |- File A - hash X                     |- File A - hash X
                                         |- File B - hash Y

New file; create new version referencing File A -> v1 and File B -> v2

  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     |- File B - hash Y

File changed (File A); create new version referencing File B -> v1, File A -> v2

Our library works by digesting the path (walking the object tree and producing pairs of files + hashes) and then comparing the new tree to the existing tree. In all of the cases above we get the expected behaviour. However, if we didn't have the whole dataset available in the new version then changing a file would result in all of the other data being removed from the next version.

  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     

File changed (File A); create new version referencing File A -> v2 but File B 
ends up removed from the new version.

So, if I've thought this through correctly, comparing against the latest inventory rather than a full digest means we will pick up file changes and file additions but we won't be able to remove a file from one version to the next as we would always need to include everything that is referenced in the latest version.

Is there another way to detect file deletions across versions without needing all of the object data up to that point?

marcolarosa commented 3 years ago

From discussion with @ptsefton.

In the current mode the library just takes the new version and adds it to an existing ocfl object. In the existing use cases this is fine as a source (e.g. omeka or some filesystem or something else) pushes data to an OCFL repo. That is, the current state is available somewhere else and we manipulate it there before updating the ocfl object.

However, in the paradisec world we're looking at using the OCFL repo as the primary source of truth. So, in that case, the library needs to know how to perform operations when updating an object so that we don't need to rehydrate a new version from the latest version each time.

For example rather than getting a copy of the current state and then manipulating the data, what we want is to just perform the minimal operations on the new version and have the library handle the change sensibly.

add a new file to an object - add to new version and merge with existing object changing only inventory files
change an existing file - add a new version of the file and merge with existing object rewriting relevant inventories
move a file - operate only on the inventories to document the move
rename a file - operate only on the inventories to document the rename
delete a file - operate only on the inventories

These operations would still create new versions of the object. It's just that the operations would be handled in a more sophisticated way so that we wouldn't need all of the current state (data) in order to work out the diff and decide to version or not.

marcolarosa commented 3 years ago

@ptsefton I've looked through the uts ocfl library commits and branches and can't see anything like what you mentioned re: identifying operations like delete. Can you please link here what you were telling me about ?

CoEDL / ocfl-js

On detecting deleted files across versions #3