OCFL / spec

The Oxford Common File Layout (OCFL) specifications
https://ocfl.io
52 stars 14 forks source link

using unique IDs instead of digest to reference files #632

Closed eroux closed 9 months ago

eroux commented 1 year ago

I'm starting to explore OCFL for our archive and I can't understand why only a digest can be used as unique identifiers for content files. I understand that a sha512 digest is a convenient way to construct (quasi) unique identifiers, but why can't users provide their own identifiers and put the digests in the fixity object? We already have globally unique identifiers for each (de-duplicated) file in our archive, and we would like to use them here. But since we want to avoid collision in our IDs at the archive level, we can't commit to using a digest (we are constructing our IDs as sha256 except in case of collision where we assign one randomly).

My proposal is to allow archives to provide their own identifiers for content files, locally unique in an object, and only suggest a digest as a convenient way to do so. This may seem like a radical proposal but I do believe it simplifies things and keeps the original spirit of the spec.

neilsjefferies commented 1 year ago

The use of digests is not just a matter of convenience. It is an important part of OCFL robustness in that it greatly improves the reconstructability of damaged or partial OCFL storage roots. When filesystems break, files can generally be recovered relatively easily but directory structures are much more fragile. By having multiple inventory copies and being able to relate files to inventory entries based purely on their content, we reduce the need to rely on directory structures reamining intact.

eroux commented 1 year ago

Sure, I understand the importance of digests, I'm not critiquing that OCFL mandates the use of them for fixity (in the fixity section or in the inventory digest file). What I'm critiquing is mandating that the unique IDs for content files need to be digests. It would be possible to use things other than digests for unique IDs and still have digests mandated otherwise.

rosy1280 commented 1 year ago

What @neilsjefferies was emphasizing is that the digests are for content addressability -- the use of a hash ties the file's contents to its location. It also allows us to reduce the amount of information stored about a file in the inventory.json; keep in mind that the fixity section is something that MAY exist if you choose to record hashes other than SHA512. Finally, using hashes reduces the amount of software necessary to recreate and/or understand a storage root and/or an object. One of the original impetuses for the OCFL was to reduce pain when migrating a repository; not all repository migrations retain IDs between technologies (e.g. you migrate from Fedora to DSpace).

Regardless this would be a breaking change and would be something we wouldn't discuss until version 2 (when we will implement breaking changes). In the meantime, you may want to consider storing ID's related to files via an extension; information on creating extensions can be found in the OCFL Extensions repository..

eroux commented 1 year ago

Thanks for your comment!

To be clear, I'm not proposing the ban of digests as unique identifiers, so projects that use it for convenience (for all the valid reasons you mention) wouldn't have to change their workflow.

I agree that it would make sense to make the fixity section mandatory if the file IDs are not digests.

I'll start with an extension for my use case, thanks!

zimeon commented 1 year ago

I think this is an interesting question that will be useful to discuss as we think about v2.

I certainly understand the possibility (albeit very improbable until one tries to store a cryptography paper in a few years that describes breaking sh512, with examples) one could have two different files that are impossible to store unchanged according to the current use of sha512 for content addressing. We did consider other linking schemes a little while working on v1, see for example https://github.com/OCFL/spec/issues/275#issuecomment-437452101 .

The questions this raises in my mind is how flexible OCFL should be -- increasing the number of implementation choices increases the complexity of the specification, increases the complexity of shared tooling, and reduces the specificity of what "implementing OCFL" means. This is an interesting set of trade-offs to consider when trying to encourage a common approach.

neilsjefferies commented 1 year ago

One possible way of doing this in an OCFL V1 compliant way without an extension would be to name the file on disk according to its ID. Its original name would still be recorded in the inventory as its logical name. Some non-OCFL repositories do this already.

awoods commented 1 year ago

@eroux : Thanks for your comments and suggestions on the OCFL specification. It is helpful.

I have not considered the JSON keys in the manifest or state blocks to be the file's "identifier", but rather a specified "key". The importance/value of using the the file's content-addressable digest as that "key" is touched on in the 3.4 Digests section. One value that I would like to highlight is that by using a file's digest the OCFL creates an inherent connection between the inventory.json and the files in the persistence layer. For the long-term robustness of an OCFL repository, having a content-addressable connection between the files and the inventory.json is critical.

As mentioned, it could be possible to use the fixity block for this purpose... which would make the fixity block mandatory instead of optional in the case that the manifest and state blocks use a custom "key". This raises the same concern for me as expressed by @zimeon :

increasing the number of implementation choices increases the complexity of the specification, increases the complexity of shared tooling, and reduces the specificity of what "implementing OCFL" means.

I would be interested to better understand the use case for using a custom unique identifier in the manifest and state blocks instead of the digest. Based on the use case, there may be other approaches to addressing the need.

zimeon commented 10 months ago

@eroux - In order to help us think about whether consideration of a change like this should be part of v2, could you describe the use case motivating this solution? From your original post I'm not sure whether the issue is compatibility with a legacy system, a concern about possible digest collisions, or something else.

eroux commented 10 months ago

@zimeon the idea is that we're designing a new system with global identifiers (global as opposed to local to an OCFL object), and I thought it would be coherent to use these unique IDs as the unique IDs of the OCFL resources

zimeon commented 10 months ago

global identifiers for files/bytestreams that are inside OCFL objects?

eroux commented 10 months ago

yes

zimeon commented 10 months ago

I have created https://github.com/OCFL/Use-Cases/issues/47 to consider this as a use case. Please correct my interpretation if I have missed it

zimeon commented 9 months ago

Closing this spec issue in favor of the use case discussion.