OCFL / Use-Cases

A repository to help capture, track, and discuss use cases for OCFL. Issues-only, please.
7 stars 0 forks source link

Corruption Recovery #15

Closed ahankinson closed 1 year ago

ahankinson commented 6 years ago

A power outage occurred when a software component was in the middle of writing an OCFL object, leaving the object in an ambiguous state. There should be mechanisms for recovering from various failure modes.

zimeon commented 6 years ago

Perhaps the first question is how one can understand the state of the OCFL object? Then, what mechanisms might avoid the possibility that the corruption could affect the integrity of a version previous to the one being added? How could one revert the partial update to get back to clean state in order to re-run the update?

ahankinson commented 6 years ago

One of the necessary tools for OCFL will be a validator, and so the state of an OCFL object would ultimately be one that is valid according to the spec. Of course, when writing a bunch of files to disk the possible failure states can range from "Connection to the NFS / S3 store failed" (i.e., relatively high-level) to "A disk array lost power and no battery was available to let it finish writing" (i.e., low-level).

It may be that this is where the spec could specify a recommended order of operations for OCFL filesystems, e.g., take checksum, write file to disk, record checksum. This would let a validator know whether a) a file was written completely (matches recorded checksum); b) a checksum was recorded correctly (a file exists with a matching checksum recorded).

I don't think we could enumerate all the possible failure states, but perhaps we could view the validation process as a bit like "fsck", where it could detect and alert the maintainer that something was wrong, giving them the ability to fix it.

zimeon commented 6 years ago

Yes, I think that given the critical place of the manifest/versions.jsonld file in being able to reconstruct state of the object from the blobs, rules for update might include rules about writing the new manifest to some agreed temporary file and then switching them over in a controlled way (which might be different in filesystem vs. cloud stores)

ahankinson commented 6 years ago

F2F 2018.09.05: An object with a version directory and no record in the inventory is invalid, which, referencing #14, is not permitted. Specific automated or manual interventions are not prescribed and are not in scope.

neilsjefferies commented 5 years ago

A lot of this is now covered in the implementation Notes on writing new versions now

zimeon commented 1 year ago

Editors' meeting 2023-09-22: OCFL, by its nature, cannot provide a strong notion of transaction. An application writing an OCFL object must manage that process and ensure that it completes creation of a valid object. On failure, there must be some cleanup and some ideas are document in the Implementation Notes - Clean up. Closing as out-of-scope.