OCFL / spec

The Oxford Common File Layout (OCFL) specifications and website
https://ocfl.io
56 stars 14 forks source link

clarify "same object state" of version block (E066) #571

Closed srerickson closed 2 years ago

srerickson commented 2 years ago

I have a question about this line from the spec:

E066: Each version block in each prior inventory file MUST represent the same object state as the corresponding version block in the current inventory file

As I understand it, the digest algorithm can change from one inventory to the next, which means the digests in the version blocks can change. If that's true, then isn't the sense of "sameness" in this statement somewhat ambiguous? I think it may help to explain what makes two version blocks the same, even when the digests may differ. Something like this: The same byte stream used for a given digest in the version block of the prior inventory must be used to generate the corresponding digest in the version block on the new inventory.

pwinckles commented 2 years ago

Let's say your versions block for v1 of an object looks like the following:

  "versions": {
    "v1": {
      "created": "2018-10-02T12:00:00Z",
      "message": "version one",
      "state": {
        "7545b8...f67": [ "file.txt" ],
        "12b348...9ac": [ "file2.txt" ]
      },
      "user": {
        "address": "alice@example.org",
        "name": "Alice"
      }
    }
  }

I believe that section of spec is to ensure that later versions don't do something like the following:

  "versions": {
    "v1": {
      "created": "2018-10-02T12:00:00Z",
      "message": "version one",
      "state": {
        "7545b8...f67": [ "file2.txt" ],
        "12b348...9ac": [ "file.txt" ]
      },
      "user": {
        "address": "alice@example.org",
        "name": "Alice"
      }
    },
    "v2": {
      "created": "2018-10-02T12:00:00Z",
      "message": "version two",
      "state": {
        "7545b8...f67": [ "file2.txt" ],
        "12b348...9ac": [ "file.txt" ],
        "3b456a...111": [ "file3.txt" ]
      },
      "user": {
        "address": "alice@example.org",
        "name": "Alice"
      }
    }
  }

In this case, both the v1 and v2 inventories would validate in isolation. However, the v2 inventory is invalid by E066 because it changes the state of v1.

I think the text that you suggested is too focused on accounting for the case where the inventory digest algorithm changes, which is not necessary for there to be a violation of E066.

srerickson commented 2 years ago

Thanks for the response @pwinckles. Do you think the language of E066 could be improved by stating explicitly that a change in the digest algorithm is not a violation? I think it might because of the ambiguity I described.

The concern I have is that it's easy for validator authors to misinterpret this part of the spec as saying that the json for version states should be equivalent across inventories -- or to otherwise misinterpret "same object state." (that's based on personal experience 😀).

pwinckles commented 2 years ago

Yes, I agree that the intent of "same object state" could be more clear.

pwinckles commented 2 years ago

When/if this is addressed, perhaps the question of unicode normalization could also be addressed? As noted in point three in https://github.com/OCFL/spec/issues/559:

The spec states "Each version block in each prior inventory file MUST represent the same object state as the corresponding version block in the current inventory file." In case of logical paths, is it up to the implementation to decide if this is a byte-for-byte comparison or a normalized comparison?

zimeon commented 2 years ago

I think this situation might be clearer if we changed the spec to say:

Each version block in each prior inventory file MUST represent the same ~object~ logical state as the corresponding version block in the current inventory file.

because we define "logical state" as logical paths tied to bitstreams, not dependent upon the digest algorithm, whereas "object state" is not formally defined.

I agree that changes in digest algorithm between inventories are fine, and do not create a problem meeting this condition. For example, the approach my validator code uses to check for E066 is to create maps from logical paths, in a particular version state, to content files (thus taking the digests entirely out of the check).

(I do also have code that provides extra debugging info using digest values, in the case that the digest algorithms do match between versions, but that isn't necessary to detect an error)