OCFL / spec

The Oxford Common File Layout (OCFL) specifications
https://ocfl.io
52 stars 14 forks source link

Adding a file size key to the inventory #629

Closed tomwrobel closed 11 months ago

tomwrobel commented 1 year ago

Following on from, but not necessarily looking to revive: https://github.com/OCFL/spec/issues/474

It would be very useful for a repository manager to know how big an OCFL object and its component binary files are on disk. It affects a lot of decisions we're likely to make regarding how to handle the object and its component files.

Given the processing work required to generate the checksum, it seems like an opportunity to include the file size of a binary file represented by a given checksum. A key akin to the 'fixity' key, containing an array of key value pairs, might allow this, e.g.

"size": {
     "4d27c8...b53": "1213131",
    "7dcc35...c31": "83488484",
    "cf83e1...a3e": "0",
    "ffccf6...62e": "85834853845384422"
}
tomwrobel commented 1 year ago

This could be considered as an additional kind of fixity check - the file should be x bytes in size - but I suspect I'm pushing the definition of the word 'fixity' here.

zimeon commented 1 year ago

2023-06-01 Editors' discussion -- This could be done within the current specification by creating an extension that defines (as mentioned in https://github.com/OCFL/spec/issues/629#issuecomment-1543788276) a new fixity type, perhaps called size, that is simply the file size.

tomwrobel commented 1 year ago

@zimeon should I make a pull request against https://github.com/OCFL/extensions/blob/main/docs/0001-digest-algorithms.md ?

zimeon commented 1 year ago

Yes, the process is outlined in https://github.com/OCFL/extensions/blob/main/docs/0001-digest-algorithms.md#maintenance -- because we are not versioning extensions the PR should create a new digest algorithms extension that obsoletes 0001

tomwrobel commented 1 year ago

Spun out to https://github.com/OCFL/extensions/issues/64

srerickson commented 1 year ago

The implication of size as a fixity digest algorithm is that collisions in fixity entries are not only unlikely, they may even be expected. I'm wondering if this represents a significant enough change in how implementers should treat the fixity block to warrant further discussion.

zimeon commented 1 year ago

Interesting question @srerickson. My feeling is that it doesn't represent a major change in how fixity should be used but I'd love to hear other thoughts. I just created a new fixture suggestion of an object that has two different files with the same MD5 value: https://github.com/OCFL/fixtures/pull/107 . Implementations have to deal with this possibility even without extension digests that might be even weaker than currently specified digests.

srerickson commented 1 year ago

@zimeon that fixture is really helpful thanks! This issue has helped me identify a problem in my own implementation where fixture collisions are treated as an error condition instead of being handled gracefully.

I don't mean to belabor the point, but I wonder if the implementation notes could address collisions a bit better. From this discussion, a key difference between fixity and manifest digests is that manifest digests are assumed to be collision-free, whereas collisions in fixity digests should be expected and handled gracefully. This point doesn't come across very clearly in the current fixity section which, instead, focuses on content addressability and tampering.

zimeon commented 1 year ago

2023-07-06 Editors' discussion - we agree that it would be helpful to add a note to the fixity section of the Implementation Notes pointing out that fixity algorithms may generate the same value for different file content

rosy1280 commented 11 months ago

algorithm extension has a PR that has been submitted and is being reviewed