OCFL / Use-Cases

A repository to help capture, track, and discuss use cases for OCFL. Issues-only, please.
7 stars 0 forks source link

Support file locking and multiple services working on same object #21

Closed zimeon closed 6 years ago

zimeon commented 6 years ago

In an archival repository based on OCFL there are multiple processes handling ingest, update (new versions), format validation (with ever improving tools that cover more formats as time goes on), and fixity checking. All of these processes may write to some part of an OCFL object (either the data itself or the logs).

Is some OCFL-object locking mechanism required to avoid collisions or race conditions? If so, is that in or out of scope for the spec?

ahankinson commented 6 years ago

cf. LockIt

I would prefer it be out of scope of the spec, since locking is hard.

Also, if we design it correctly, objects should only ever be operating on consistent data. Operations like format validation should be a read-only. Logging is already partially out-of-scope (we suggest that there should be logs, but not what those logs should contain, or how they should be written.)

So the only real crunch point I can see is when two operations want to create a new version simultaneously. If we do our maths correctly, we can minimize this chance by EITHER leaving the version number calculation to the very end (minimizing the amount of time when there might be two write operations for the same version number) OR writing the anticipated version number as soon as possible (thus serving as an implicit lock for that version and its contents), and then writing the contents to that directory (which may be a much slower operation if the files are large).

In the former option, we may need to implement shadow writes (write the entire version to a temporary space in the same filesystem, and then move it into place with a new version number, an operation that should be almost immediate.)

In the case of HTTP object stores, though, all bets are off.

zimeon commented 6 years ago

I agree that implementing locking with a decentralized object store like S3 would be tricky (I note that S3 doesn't support If-Match and related headers for write operations). I also think that the way you would try to implement or imitate locking on S3 might be quite different from reliance on unix atomic filesystem operations.

I also agree that creation of a new version and corresponding update of versions.jsonld is probably the key operation to worry about (given that the specifics of logging are out-of-scope). On a unix filesystem the use of atomic mkdir (with failure if the directory exists) provides a easy way to get an implicit lock. On S3 there aren't any directories really (just object names that look that way) so that wouldn't work. Maybe one way to address this is think about the spec as describing primarily the "file/object layout at rest", properties, validity constraints etc. and have a separate discussion of managing change.

The discussion of version inventory / forward versioning (https://github.com/OCFL/spec/issues/3) has some impact on the problems of updating a top-level versions.jsonld and recovery options in case of a problem.

ahankinson commented 6 years ago

Yes, differentiating between 'object at rest' and 'managing change' is an excellent way of putting it. I had tried to frame it as 'spec' and 'client behaviours', but I like 'managing change' better.

ahankinson commented 6 years ago

F2F 2018.09.05: Out of scope; file locking and object consistency over multiple services is a local consideration and the OCFL spec will be silent on this.