Open lnielsen opened 5 years ago
Depending on the underlying storage system you have in place your disks may be doing this deduplication transparently.
How do you handle this case at the moment?
Our storage system doesn't handle it (it's http://eos.web.cern.ch with some 400PB of disk space). Essentially if e.g. hard symlinks where allowed, some system operating on the OCFL objects probably even wouldn't know that it's deduplicated.
The problem is with either the requirement on not using hard links:
Hard and soft (symbolic) links are not portable and must not be used within
OCFL Storage hierarchies. A common use case for links is storage deduplication.
OCFL inventories provide a portable method of achieving the same effect by using
digests to address content.
or with the assumed linear versioning.
Note, I've been discussing this with @neilsjefferies IRL as well.
I think this is a special case of reference to a file/datastream external to an OCFL object. This is partly discussed in https://github.com/OCFL/Use-Cases/issues/27 but I created https://github.com/OCFL/Use-Cases/issues/35 to separate out the idea of an external file. IMO this is out-of-scope for v1 but we should revisit when considering scope of v2.
A reference to file/datastream in another OCFL object could solve the issue. My general thinking here is that a reference to file/datastream anywhere is not a good idea, but that instead it should be constrained to the OCFL storage root.
I fully understand that you want to get v1 out the door. Just know that this is kind of a show stopper for using OCFL for us, so a quick v2 release afterwards would be much appreciated. We have 1.4 million OCFL objects and 300TB of data to write, so I'd prefer not having to rewrite them :-) Obviously, I'm happy to help out, in case there's anything I can do to accelerate it.
Thanks, @lnielsen. Taking a step back, for clarification, what is the rationale for your decision of:
They [sic] way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object.
It is conceivable that separate versions of a single OCFL Object could have their own DOIs.
@awoods It's related to the two levels of versioning that I call versioning and revisions and that they can happen in different sequences (e.g. v1.0, v2.0, v1.1
or v1.0, v1.1, v2.0
).
I'll try to see if I can give a clear example 😄 and of course don't hesitate to let me know if there's something obvious that I just haven't seen.
If I change my initial example to use a single OCFL object it would look like this (after the three actions):
[multi-doi-object]
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1 # 10.5821/zenodo.1234
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── content
│ ├── data-01.zip
│ └── mishap.zip
├── v2 # 10.5821/zenodo.4321
| ├── inventory.json
| ├── inventory.json.sha512
| └── content
| └── data-02.zip
└── v3 # 10.5821/zenodo.1234
├── inventory.json
├── inventory.json.sha512
└── content
So far so good. I've managed to represent the changes in an OCFL object.
Now let's switch the order of actions from 1, 2, 3 to 1, 3, 2. My OCFL object would instead look like this:
[multi-doi-object]
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1 # 10.5821/zenodo.1234
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── content
│ ├── data-01.zip
│ └── mishap.zip
├── v2 # 10.5821/zenodo.1234
| ├── inventory.json
| ├── inventory.json.sha512
| └── content
└── v3 # 10.5821/zenodo.4321
├── inventory.json
├── inventory.json.sha512
└── content
└── data-02.zip
So far so good as well. I've achieved deduplication of the big file.
The problem I see with this structure is that it's non-trivial/non-intuitve to find the latest state of a specific DOI, and thus requires interpretation on top of OCFL in order to be understandable. The reason for using OCFL in the first place, is to have an self-evident structure that requires no other knowledge than OCFL.
Similarly, I could also imagine hacks to make things work like writing a completely new OCFL object and deleting the old one. But then performance would be an issue.
Hi @lnielsen! We have this issue at Stanford ("In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.") and don't have a perfect solution, but have approached it in two ways:
If a user inadvertently accessions personal info in an object, we have to purge the entire object from SDR and re-accession it with the same identifier and cleaned content. It's a pain to do (deletes are hard by design!) but it's the only way to truly purge sensitive data from an existing object.
For incremental additions to large datasets, we try to break the dataset into smaller logical pieces (still in zip files, but not one big zip for the entire dataset). This also requires some curatorial intervention but we've found that it provides a slightly better user experience, especially for downloading that dataset. It also gives us a chance that future dataset changes impact only a handful of prior zips (or maybe even non at all!), allowing us to leverage the incremental diff feature of Moab (which OCFL also implements).
Copies from use-cases...general musing...so not completely thought out,
I can imagine a minor modification to the inventory that adds "inherits from ObjectID" type sections to the manifest. The digests that follow identify paths in other OCFL object(s). Other than that nothing else needs to change. When copying an object, parsing the manifest tells you which additional objects it has dependencies on. It would permit version forking and inter-object deduplication. this does mean that if object versions are not stored as a single units then each version has a new ID - this is not necessarily a bad thing.
...this might also be adapted to include "Inherits from external_storage_path" in some form.
In advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments.
In addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as Proposed: In Scope
for version 2. You can contribute to the poll for this use case by reacting to this comment. The following reactions are supported:
In favor of the use case | Against the use case | Neutral on the use case |
---|---|---|
👍🏼 | 👎🏼 | 👀 |
The poll will remain open through the end of February 2024.
This could be quite complicated if #42 (file deletion) also makes it into v2. Implementations would need to handle (or prevent) deletion of inherited files in the parent. Bi-directional references (both child-to-parent and parent-to-child) would make it easier to understand down stream consequences of a file deletion.
Just adding a new key "inherits", containing a list of object ids including version, to the basic inventory structure should not be problematic and won't interfere with any other features. on the same level, there could be a "deprecates" key too.
At the time of this comment the vote tallied to +3. Confirming this as in scope for version 2 -- of course how to do that is still a question.
These notes reference the Object Forking Use Case, which is Use Case 44. The use case is supported via content addressable storage. This introduces the concept of parent (the original object) and child (the object that is forked from the original object.
manifest
block of the inheriting child object.state
block lists the logical path as normal, allowing users to change the file name when inherited from a parent.fixity
block, and the verifier must lookup the parent object."digestAlgorithm"
as all parent objects.When a parent object is deleted:
ocfl_layout.json
file.When a referenced file is deleted in a parent object:
state
block of a child object references a deleted file in a parent object.Question:
inventory.json
of the child object?When a file is corrupted in a parent object:
{
"digestAlgorithm": "sha512",
"head": "v3",
"id": "ark:/12345/bcd987",
"manifest": {
"4d27c8...b53": [ "v2/content/foo/bar.xml" ],
"7dcc35...c31": [ { "objectid": "ark:/67890/fgh123" } ],
"df83e1...a3e": [ { "objectid": "ark:/67890/fgh123" } ],
"ffccf6...62e": [ { "objectid": "ark:/67890/fgh123" } ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2018-01-01T01:01:01Z",
"message": "Initial import. bar.xml, bigdata.dat and image.tiff are inherited from a parent object.",
"state": {
"7dcc35...c31": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ],
"ffccf6...62e": [ "image.tiff" ]
},
"user": {
"address": "mailto:alice@example.com",
"name": "Alice"
}
},
"v2": {
"created": "2018-02-02T02:02:02Z",
"message": "Fix bar.xml replacing import with a local edit, remove image.tiff",
"state": {
"4d27c8...b53": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ]
},
"user": {
"address": "mailto:bob@example.com",
"name": "Bob"
}
},
"v3": {
"created": "2018-03-03T03:03:03Z",
"message": "Reinstate image.tiff",
"state": {
"4d27c8...b53": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ],
"ffccf6...62e": [ "image.tiff" ]
},
"user": {
"address": "mailto:cecilia@example.com",
"name": "Cecilia"
}
}
}
}
Would it be a solution to change the manifest definition from
The value for each key MUST be an array containing the content paths of files in the OCFL Object that have content with the given digest
to
The value for each key MUST be an array containing the content paths of files in the OCFL Object that have content with the given digest or an URI which refers to exactly one object.
This would mean, that there's just an URI check (a colon in the string) needed to figure out, whether the file is inside the OCFL object or remote. Having a union as value is quite hard to implement.
I agree with @je4: my preference would be to avoid designs where a schema value can have more than one possible type (i.e., string or json object). Besides the suggestion from @je4, above, another approach would be to define manifest values as objects like:
{
"4d27c8...b53": { "paths": ["v2/content/foo/bar.xml"] },
"7dcc35...c31": { "id": "ark:/67890/fgh123" },
"df83e1...a3e": { "id": "ark:/67890/fgh123" },
"ffccf6...62e": { "id": "ark:/67890/fgh123" }
}
Yet another approach:
...
"manifest": {
"4d27c8...b53": ["v2/content/foo/bar.xml"],
},
"refs": {
"7dcc35...c31": "ark:/67890/fgh123",
"df83e1...a3e": "ark:/67890/fgh123",
"ffccf6...62e": "ark:/67890/fgh123"
}
...
The idea here is to add a new key in the inventory (e.g., refs
) for references to other objects. Digests in the version state must be included in either the manifest
or the refs
block.
In Zenodo we have a use case where we have two layers of versioning. A user can publish a dataset on Zenodo which will get a DOI. A new version of the dataset can be published by the user, which will get it a new DOI. This way a DOI always point to a locked set of digital files. Occasionally, however, we have the need to change files of an already published dataset with a DOI (e.g. user accidentally included personal data in the dataset and discovered 2 months later). Essentially this means we have two layers of versioning in Zenodo, which I'll call
In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.
They way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object. Because OCFL object only supports deduplication within an OCFL object, and not between OCFL objects, nor does OCFL allow symlinks, then we cannot do this deduplication.
Example
Imagine these actions:
data-01.zip
andmishap.zip
data-02.zip
(files is thus:data-01.zip
anddata-02.zip
).mishap.zip
from 10.5281/zenodo.1234The OCFL objects would be:
What I would like is not having to duplicate
data-01.zip
in10.5821/zenodo.4321
OCFL object?Is there a solution for this in OCFL, or a different way to construct our OCFL objects that could support this?