OCFL Object Forking - Githubissues

lnielsen commented 5 years ago

In Zenodo we have a use case where we have two layers of versioning. A user can publish a dataset on Zenodo which will get a DOI. A new version of the dataset can be published by the user, which will get it a new DOI. This way a DOI always point to a locked set of digital files. Occasionally, however, we have the need to change files of an already published dataset with a DOI (e.g. user accidentally included personal data in the dataset and discovered 2 months later). Essentially this means we have two layers of versioning in Zenodo, which I'll call

Versioning (each version get's a new DOI - at the repository level each version is a separate record)
Revisions (edits to a single version - at the repository level this a single record).

In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.

They way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object. Because OCFL object only supports deduplication within an OCFL object, and not between OCFL objects, nor does OCFL allow symlinks, then we cannot do this deduplication.

Example

Imagine these actions:

Publish first version 10.5281/zenodo.1234 with two very large (let's just say 100TB to exaggerate) files: data-01.zip and mishap.zip
Publish new version 10.5281/zenodo.4321 with one new file: data-02.zip (files is thus: data-01.zip and data-02.zip).
Remove mishap.zip from 10.5281/zenodo.1234

The OCFL objects would be:

[10.5821/zenodo.1234]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1
    │   ├── inventory.json
    │   ├── inventory.json.sha512
    │   └── content
    │       ├── data-01.zip
    │       └── mishap.zip
    └── v2
        ├── inventory.json
        ├── inventory.json.sha512
        └── content

[10.5821/zenodo.4321]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    └── v1
        ├── inventory.json
        ├── inventory.json.sha512
        └── content
            ├── data-01.zip (duplicatied 100TB of data!!!)
            └── data-02.zip

What I would like is not having to duplicate data-01.zip in 10.5821/zenodo.4321 OCFL object?

Is there a solution for this in OCFL, or a different way to construct our OCFL objects that could support this?

ahankinson commented 5 years ago

Depending on the underlying storage system you have in place your disks may be doing this deduplication transparently.

How do you handle this case at the moment?

lnielsen commented 5 years ago

Our storage system doesn't handle it (it's http://eos.web.cern.ch with some 400PB of disk space). Essentially if e.g. hard symlinks where allowed, some system operating on the OCFL objects probably even wouldn't know that it's deduplicated.

The problem is with either the requirement on not using hard links:

Hard and soft (symbolic) links are not portable and must not be used within 
OCFL Storage hierarchies. A common use case for links is storage deduplication. 
OCFL inventories provide a portable method of achieving the same effect by using 
digests to address content.

or with the assumed linear versioning.

Note, I've been discussing this with @neilsjefferies IRL as well.

zimeon commented 5 years ago

I think this is a special case of reference to a file/datastream external to an OCFL object. This is partly discussed in https://github.com/OCFL/Use-Cases/issues/27 but I created https://github.com/OCFL/Use-Cases/issues/35 to separate out the idea of an external file. IMO this is out-of-scope for v1 but we should revisit when considering scope of v2.

lnielsen commented 5 years ago

A reference to file/datastream in another OCFL object could solve the issue. My general thinking here is that a reference to file/datastream anywhere is not a good idea, but that instead it should be constrained to the OCFL storage root.

I fully understand that you want to get v1 out the door. Just know that this is kind of a show stopper for using OCFL for us, so a quick v2 release afterwards would be much appreciated. We have 1.4 million OCFL objects and 300TB of data to write, so I'd prefer not having to rewrite them :-) Obviously, I'm happy to help out, in case there's anything I can do to accelerate it.

awoods commented 5 years ago

Thanks, @lnielsen. Taking a step back, for clarification, what is the rationale for your decision of:

They [sic] way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object.

It is conceivable that separate versions of a single OCFL Object could have their own DOIs.

lnielsen commented 5 years ago

@awoods It's related to the two levels of versioning that I call versioning and revisions and that they can happen in different sequences (e.g. v1.0, v2.0, v1.1 or v1.0, v1.1, v2.0).

I'll try to see if I can give a clear example 😄 and of course don't hesitate to let me know if there's something obvious that I just haven't seen.

If I change my initial example to use a single OCFL object it would look like this (after the three actions):

[multi-doi-object]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1               # 10.5821/zenodo.1234
    │   ├── inventory.json
    │   ├── inventory.json.sha512
    │   └── content
    │       ├── data-01.zip
    │       └── mishap.zip
    ├── v2               # 10.5821/zenodo.4321
    |   ├── inventory.json
    |   ├── inventory.json.sha512
    |   └── content
    |       └── data-02.zip
    └── v3               # 10.5821/zenodo.1234
        ├── inventory.json
        ├── inventory.json.sha512
        └── content

So far so good. I've managed to represent the changes in an OCFL object.

Now let's switch the order of actions from 1, 2, 3 to 1, 3, 2. My OCFL object would instead look like this:

[multi-doi-object]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1               # 10.5821/zenodo.1234
    │   ├── inventory.json
    │   ├── inventory.json.sha512
    │   └── content
    │       ├── data-01.zip
    │       └── mishap.zip
    ├── v2               # 10.5821/zenodo.1234
    |   ├── inventory.json
    |   ├── inventory.json.sha512
    |   └── content
    └── v3               # 10.5821/zenodo.4321
        ├── inventory.json
        ├── inventory.json.sha512
        └── content
            └── data-02.zip

So far so good as well. I've achieved deduplication of the big file.

The problem I see with this structure is that it's non-trivial/non-intuitve to find the latest state of a specific DOI, and thus requires interpretation on top of OCFL in order to be understandable. The reason for using OCFL in the first place, is to have an self-evident structure that requires no other knowledge than OCFL.

Similarly, I could also imagine hacks to make things work like writing a completely new OCFL object and deleting the old one. But then performance would be an issue.

julianmorley commented 5 years ago

Hi @lnielsen! We have this issue at Stanford ("In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.") and don't have a perfect solution, but have approached it in two ways:

If a user inadvertently accessions personal info in an object, we have to purge the entire object from SDR and re-accession it with the same identifier and cleaned content. It's a pain to do (deletes are hard by design!) but it's the only way to truly purge sensitive data from an existing object.
For incremental additions to large datasets, we try to break the dataset into smaller logical pieces (still in zip files, but not one big zip for the entire dataset). This also requires some curatorial intervention but we've found that it provides a slightly better user experience, especially for downloading that dataset. It also gives us a chance that future dataset changes impact only a handful of prior zips (or maybe even non at all!), allowing us to leverage the incremental diff feature of Moab (which OCFL also implements).

neilsjefferies commented 5 years ago

Copies from use-cases...general musing...so not completely thought out,

I can imagine a minor modification to the inventory that adds "inherits from ObjectID" type sections to the manifest. The digests that follow identify paths in other OCFL object(s). Other than that nothing else needs to change. When copying an object, parsing the manifest tells you which additional objects it has dependencies on. It would permit version forking and inter-object deduplication. this does mean that if object versions are not stored as a single units then each version has a new ID - this is not necessarily a bad thing.

...this might also be adapted to include "Inherits from external_storage_path" in some form.

neilsjefferies commented 1 year ago

In spec, must be explicit that a newly created object is “inheriting” from an existing object within the same storage root (thus while we violate completeness for an object, we will still have completeness for a storage root) -- However, if in v2, we come up with a mechanism for multiple storage roots making one repo, we should support inheritance from any of the roots making up the repo (which might mean one has a general root reference mechanism.... But with a dire warning to be prudent)
We do not want to support an OCFL object referencing files from across multiple other objects. This is to prevent validation loops. Thus: -- Inheritance can only happen when a new object is instantiated, an entire manifest block of the source object is inherited -- We will only support inheritance from a single object -- The new object must clearly reference the version of the original object from which it is inheriting
Implementation notes: -- Validation failure if original object subsequently is purged -- Validation of new object must also validate the original (parent) object. This is a reason why we have the same storage root requirement. -- validator must have guard rails to catch loops.

rosy1280 commented 1 year ago

Feedback on Use Cases

In advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments.

Polling on Use Cases

In addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as Proposed: In Scope for version 2. You can contribute to the poll for this use case by reacting to this comment. The following reactions are supported:

In favor of the use case	Against the use case	Neutral on the use case
👍🏼	👎🏼	👀

The poll will remain open through the end of February 2024.

srerickson commented 12 months ago

This could be quite complicated if #42 (file deletion) also makes it into v2. Implementations would need to handle (or prevent) deletion of inherited files in the parent. Bi-directional references (both child-to-parent and parent-to-child) would make it easier to understand down stream consequences of a file deletion.

je4 commented 12 months ago

Just adding a new key "inherits", containing a list of object ids including version, to the basic inventory structure should not be problematic and won't interfere with any other features. on the same level, there could be a "deprecates" key too.

rosy1280 commented 8 months ago

At the time of this comment the vote tallied to +3. Confirming this as in scope for version 2 -- of course how to do that is still a question.

rosy1280 commented 1 month ago

Object Forking Notes (File Inheritance)

These notes reference the Object Forking Use Case, which is Use Case 44. The use case is supported via content addressable storage. This introduces the concept of parent (the original object) and child (the object that is forked from the original object.

We support this by inserting one or more pointers to one or more files in one or more parent objects. This is placed in the manifest block of the inheriting child object.
The version state block lists the logical path as normal, allowing users to change the file name when inherited from a parent.
A child object can inherit arbitrary files from multiple parent objects; it's not limited to the set of files of a single version from a single parent object. This is an implementation detail.
However, an implementer may choose to limit this feature to all files in a specific version of a single parent object, if desired. This is also an implementation detail.
A child object can only inherit files from parent objects in the same storage root. OCFL has no mechanism for referencing files outside of the current storage root.
Inherited files cannot be included in the child's fixity block, and the verifier must lookup the parent object.
The child object must use the same "digestAlgorithm" as all parent objects.
File inheritance MUST NOT inherit a file from a grandparent. i.e., the act of creating a file link involves verifying with the parent object that the file exists in that object, and is not itself a pointer to another object's file.
- There is no benefit to inheriting a file from a grandparent, it only creates complexity and the specification aims for simplicity.
- To prevent recursion loops, validators must only check to one level of recursion when validating any object.

When a parent object is deleted:

In a storage root that supports file inheritance a flag MUST be placed in the ocfl_layout.json file.
If you delete an object, you MUST check whether another object inherits files from that object. Implementation notes will address how to do this.
We will create an extension as part of version 2 allowing you to document the child objects that depend on files in the parent object. verification of child objects will fail with a descriptive error (parent object no longer exists)

When a referenced file is deleted in a parent object:

Tombstoning will be propagated via the verification process of the child object (i.e., the file has been deleted in parent object).
A soft delete or rename in the parent object does not impact the child object in any way, as the original bitstream remains on disk in the parent's content directory and referenced in the parent's inventory.
A child object is invalid if the current state block of a child object references a deleted file in a parent object.

Question:

Should the tombstones get placed in the inventory.json of the child object?
Or does the implementation notes address the use of tombstoning in a parent as it may make a child object invalid?

When a file is corrupted in a parent object:

verification process should flag it the same as in parent object (i.e. file is corrupted in parent object)

A full inventory.json example of file inheritance

{
  "digestAlgorithm": "sha512",
  "head": "v3",
  "id": "ark:/12345/bcd987",
  "manifest": {
    "4d27c8...b53": [ "v2/content/foo/bar.xml" ],
    "7dcc35...c31": [ { "objectid": "ark:/67890/fgh123" } ],
    "df83e1...a3e": [ { "objectid": "ark:/67890/fgh123" } ],
    "ffccf6...62e": [ { "objectid": "ark:/67890/fgh123" } ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2018-01-01T01:01:01Z",
      "message": "Initial import. bar.xml, bigdata.dat and image.tiff are inherited from a parent object.",
      "state": {
        "7dcc35...c31": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ],
        "ffccf6...62e": [ "image.tiff" ]
      },
      "user": {
        "address": "mailto:alice@example.com",
        "name": "Alice"
      }
    },
    "v2": {
      "created": "2018-02-02T02:02:02Z",
      "message": "Fix bar.xml replacing import with a local edit, remove image.tiff",
      "state": {
        "4d27c8...b53": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ]
      },
      "user": {
        "address": "mailto:bob@example.com",
        "name": "Bob"
      }
    },
    "v3": {
      "created": "2018-03-03T03:03:03Z",
      "message": "Reinstate image.tiff",
      "state": {
        "4d27c8...b53": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ],
        "ffccf6...62e": [ "image.tiff" ]
      },
      "user": {
        "address": "mailto:cecilia@example.com",
        "name": "Cecilia"
      }
    }
  }
}

je4 commented 1 month ago

Would it be a solution to change the manifest definition from

The value for each key MUST be an array containing the content paths of files in the OCFL Object that have content with the given digest

to

The value for each key MUST be an array containing the content paths of files in the OCFL Object that have content with the given digest or an URI which refers to exactly one object.

This would mean, that there's just an URI check (a colon in the string) needed to figure out, whether the file is inside the OCFL object or remote. Having a union as value is quite hard to implement.

srerickson commented 1 month ago

I agree with @je4: my preference would be to avoid designs where a schema value can have more than one possible type (i.e., string or json object). Besides the suggestion from @je4, above, another approach would be to define manifest values as objects like:


{
  "4d27c8...b53": { "paths": ["v2/content/foo/bar.xml"] },
  "7dcc35...c31": { "id": "ark:/67890/fgh123" },
  "df83e1...a3e": { "id": "ark:/67890/fgh123" },
  "ffccf6...62e": { "id": "ark:/67890/fgh123" }
}

srerickson commented 3 weeks ago

Yet another approach:

...
"manifest": {
  "4d27c8...b53": ["v2/content/foo/bar.xml"],
},
"refs": {
  "7dcc35...c31": "ark:/67890/fgh123",
  "df83e1...a3e": "ark:/67890/fgh123",
  "ffccf6...62e": "ark:/67890/fgh123"
}
...

The idea here is to add a new key in the inventory (e.g., refs) for references to other objects. Digests in the version state must be included in either the manifest or the refs block.

OCFL / Use-Cases

OCFL Object Forking #44

Feedback on Use Cases

Polling on Use Cases

Object Forking Notes (File Inheritance)

A full inventory.json example of file inheritance