OCFL / Use-Cases

A repository to help capture, track, and discuss use cases for OCFL. Issues-only, please.
7 stars 0 forks source link

Support physical file-level deletion #42

Open slabrams opened 1 year ago

slabrams commented 1 year ago

There are legitimate curatorial reasons for being able to physically remove individual files from an object. Right now, the only way to deal with this is through the Purge procedure outlined in the Implementation notes. This requires deleting the entire object and then re-creating it without the implicated files. It would be useful to work with the OCFL community to create an easier way to do this in a more automated manner that would rewrite inventories and perhaps leave a tombstone someone, either in the directory structure or just as metadata.

zimeon commented 11 months ago

Thoughts from 2023-09-22 editors' meeting:

This would be a big change to OCFL, where up to v1.1 we consider versions to immutable once written. We seems possible uses with mutable filesystems where these is come compelling reason to delete a file, or with either mutable or immutable storage where a file is corrupted and irrecoverable. Unless the whole object is rewritten, in both of these cases the versions using the file will be broken and fail validation. This could be indicated in a new version (that does validate) that indicates what content from prior versions is no longer available or valid.

One way to do this would be to have something like a tombstone block that parallels the manifest block. So, imagine that the one file.txt in the spec's minimal object example is broken/deleted, then a v2 inventory might be:

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal",
  "manifest": { },
  "tombstone": {
    "7545b8...f67": [ "v1/content/file.txt" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2018-10-02T12:00:00Z",
      "message": "One file",
      "state": {
        "7545b8...f67": [ "file.txt" ]
      },
      "user": {
        "address": "mailto:alice@example.org",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2018-10-02T12:00:00Z",
      "message": "The one file is gone",
      "state": { },
      "user": {
        "address": "mailto:alice@example.org",
        "name": "Alice"
      }
    }
  }
}

If something like this were supported then we may wish to have a root level flag to say that the root has mutable content

neilsjefferies commented 11 months ago

@tomwrobel

tomwrobel commented 11 months ago

I gave some thought to this some time ago, while coming up with the description for how to purge files in ORA if it were ever required.

There were a few things I thought important. What is presented here isn't a proposed solution for the community, but its a list of considerations and what we thought important. I like the idea of a manifest section much more than I like our proposed internal solution (a json file)!

We would want to record why a file was purged

File purges can happen for an arbitrary reason, such as because a file became corrupt, but when they happen for a reason, it's often a legal or other compliance reason. We might, therefore, want to be able to audit the object at a later stage. If we were to find two copies of the object, one with a purged file and one without, it would be useful to know if we could restore the file (if the file was purged from OCFL because it was corrupt) or if we should never restore the file.

We decided to store the date/time of purge, the user responsible for the purge, and a message stating the reason for the purge

We would want to know which file was purged

We would want to maintain a record of which file was purged. This would allow us to demonstrate that the file was previously present, but was no longer there. Again, this allows for accurate comparison with a copy of an object which contained the binary file, as well as providing a demonstration that the file that was purged was no longer on the system (i.e. would be possible to demonstrate that no file with that checksum remained). We didn't want to preserve filenames, as a filename in itself might constitute purged information. We settled on storing a checksum of the purged binary file, alongside the digest algorithm used to generate that checksum.

We would want to know the state of the object at the time of purge

This was a way to be able to compare two copies of an object, one with the purged file and one without. The solution to this was to store the inventory digest for the current version of the record at the time of purge (updated thought: better would be the inventory digests for all versions of the object at time of the purge). This would allow comparison between two copies of the same object.

What we proposed internally (not necessarily a good idea)

Create a new version of the object with a filename of {sha512_of_binary_file}.purged.json. This would create a json file with the following information:

rosy1280 commented 10 months ago

Feedback on Use Cases

In advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments.

Polling on Use Cases

In addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as Proposed: In Scope for version 2. You can contribute to the poll for this use case by reacting to this comment. The following reactions are supported:

In favor of the use case Against the use case Neutral on the use case
👍🏼 👎🏼 👀

The poll will remain open through the end of February 2024.

je4 commented 10 months ago

As long as inventory.json is not changed, the deletion of files should be supported. To prevent the validation from failing, the files could be replaced with a file with a defined checksum, which then generates a warning instead of an error during validation. Tracking can then take place in a new version.

bdwheele commented 6 months ago

This seems useful, as we've had cases were we've had to delete files out of objects for legal reasons.
One question: does the content address exist for the removed in both the manifest and in the tombstone or just in tombstone?

rosy1280 commented 6 months ago

At the time of this comment the vote tallied to +6. Confirming this as in scope for version 2 of the specification