Disk space usage and version inventory files

bcail commented 5 years ago

Some in the Fedora community have run some rough tests that analyze disk usage for OCFL objects with many versions and/or many files. Here are some numbers from @pwinckles:

I just ran another test where I updated an object that contains 9 files 1000 times. Each update mutated the content of a single existing file. The size of the initial object content was 48KB. v1 inventory bytes: 3,461 v1001 inventory bytes: 1,874,152 Final repository size: 908MB

From another test:

At v1 the entire repository was 352MB, and at v26 it'd grown by 426MB to 778MB. The file that I added in each new version was only 4 bytes. This growth is entirely due to the inventory size.

The storing of the inventory file in every version directory contributes significantly to the overall size of the OCFL object on disk. Some questions:

Is OCFL meant to support the use cases of many files in one object, and many versions in the same object? Or are these scenarios not recommended for OCFL?
Is the recommended solution for reducing disk usage to just not store the inventory.json file in the versions, since the spec makes that a "SHOULD"?
Some other solutions could be to 1) write inventory.json files that are just diffs to the previous version in the version directories or 2) compress the inventory.json file in the version directories. Would either of those be possibilities, or are there any other solutions for reducing disk space usage?

jrochkind commented 5 years ago

My understanding is that the SHA512 checksums are a large part of the bytes in those inventory files.

Storing SHA256 instead could be a ~50% savings in inventory bytes. I wonder if SHA256 should be recommended instead of SHA512. (This would be a bytesize savings on any OCFL repo with many files, regardless of whether many versions or just many independent objects).

SHA256 plus only one copy of inventory file (instead of an additional copy in the version dirs) could then be a ~75% reduction in bytes. (Is still going to be uncomfortably many bytes, or would that be cool? Not sure).

Both of these things would, I think, be allowed by the spec now (if one ignores one or more SHOULDs), but possibly:

the spec should be changed to recommend them to make things smoother for many-versions;
OR clear implementation guideance for "if you are going to have many versions THEN we recommend different SHOULDs" could be provided (This would also make it clear that clients should support these options if they want to support many-versions uses).
OR the spec could make clear that OCFL is not recommended for voluminous use of many (more than you can count on your fingers?) versions.

ahankinson commented 5 years ago

Is OCFL meant to support the use cases of many files in one object, and many versions in the same object? Or are these scenarios not recommended for OCFL?

Yes -- this is the core use case for OCFL: Many files, with one or more versions.

Is the recommended solution for reducing disk usage to just not store the inventory.json file in the versions, since the spec makes that a "SHOULD"?

The 'SHOULD' was put there, as far as I remember, to help facilitate migration from older folder structure layouts. It was 'MUST' for several iterations of the spec, but after feedback the editors decided to change it. It is strongly suggested that copies of the inventory file be maintained in the version directories.

The inventory.json file in each version is the canonical record of the files in that version. The copy in the root of the object is to facilitate operations on the object in its most recent known version, and reduce the need for scanning the version directories to determine the most recent version of the object.

Some other solutions could be to 1) write inventory.json files that are just diffs to the previous version in the version directories or 2) compress the inventory.json file in the version directories. Would either of those be possibilities, or are there any other solutions for reducing disk space usage?

We had tossed around the idea of storing version directories themselves as ZIP (or Gzip) archives, but felt we needed to get the basics of the storage mechanism right first before looking at optimizations. This may come back in v2 of the spec. Watch this space.

There is no requirement that the JSON contain spaces or line-endings. In my experience significant savings, especially for large JSON files, can be made by removing whitespace characters. It's not a particularly clever method, but a quick check using one of our IIIF manifests is 2.3 MB formatted, 1.1 MB raw (48% reduction).

OR the spec could make clear that OCFL is not recommended for voluminous use of many (more than you can count on your fingers?) versions.

The use case OCFL is designed to address is not as part of a workflow system, it is for 'objects at rest'. While there certainly are cases where an object goes through multiple versions in the process of being accessioned, the intended use of OCFL is to store a 'settled' version of the object. Changes over the object's lifespan are, of course, common, but in gathering our use cases we did not hear about objects with thousands of versions. Off the top of my head, I believe the most changed objects that Stanford stores in Moab has roughly 20-30 versions. Most are one to five. @julianmorley could probably give you a more accurate answer.

I would be interested in seeing the inventory.json that was produced. Are you storing the checksum for each version, or are you making use of the forward-versioning capabilities in OCFL to store the differences?

The spec permits the use of SHA256 as the digest algorithm if the difference in disk space usage is critical for your use case: https://ocfl.io/draft/spec/#digests

The reasons for the recommendation of SHA512 has been explored quite extensively in Github issues, Use cases, and community calls. See: #7 #8 #21 #290, https://groups.google.com/d/msg/ocfl-community/TU2zNYex0ao/w-hsJJWQBgAJ, https://github.com/OCFL/spec/wiki/2018.11.07-Editors-Meeting.

pwinckles commented 5 years ago

Attached is the inventory file with a lot of versions. I had to rename it to upload it. The other one is too large to be uploaded directly without jumping through additional hoops, but let me know if you'd like to see that one too.

large-version-inventory.txt

Stripping white space is a good idea and will definitely save space. I was hoping not to have to do that because it makes the files so hard to read, but it seems like I'll need to at least make that option toggleable. I'll rerun the test tomorrow without the whitespace.

jrochkind commented 5 years ago

Thanks for the response @ahankinson

The use case OCFL is designed to address is not as part of a workflow system, it is for 'objects at rest'.

I think it would be valuable for the specs and/or implementation guidance to be clear on this. Or some people are going to try to use it for cases it wasn't designed for, and find themselves in pain. Arguably this has already started, if you consider the fedora planning to use OCFL to be "as part of a workflow system", which may depend on how users of fedora use fedora, and also what you mean by "workflow system" vs "objects at rest".

So such guidance could say more about that too. Obviously you intend to support versions, since they are such a big part of the spec; so maybe you mean "we expect most objects to have only one version, those that have more than one to have only few, and versions to be made infrquently?" Not sure. Especially when put together with your statement that "many files in one object and many versions in the same object" IS the core use case for OCFL.

This is relevant, for instance, because I have observed discussions on the slack with fedora implementers trying to decide how to use OCFL, and guidance on the use cases OCFL means to support would probably help some of those discussions come to resolution.

[edit: deleted stuff where I was confused about what was going on with inventories in root vs version directories, I don't understand it enough to speak on it!]

pwinckles commented 5 years ago

For reference, I ran the same two tests again to try sha512 without pretty json, sha256 with pretty json, and sha256 without pretty json. Here are the results:

https://docs.google.com/spreadsheets/d/1dffxLQoLP26dSEx39hlhXCBw5VejIt3Ig4zliPHoUpQ/edit?usp=sharing

bcail commented 5 years ago

@pwinckles thanks for doing those numbers - that's great.

I do see section 2.1.5 in the implementation notes, where it recommends packaging many small files together: https://ocfl.io/0.3/implementation-notes/#objects-with-many-small-files.

@ahankinson and other editors - do you think it would be helpful to add a note about OCFL not being for a "workflow system", as @jrochkind suggested? It might be helpful to explicitly note that 20-30 versions is expected, rather than 2000-3000? Or, maybe you could add a note that if an object gets too many versions, the many inventory.json files will take a lot of disk space?

ahankinson commented 5 years ago

I think having a conversation about the intended uses in the community call would be useful prior to adding a specific note about not being part of a 'workflow' system. The challenge I see is that adding a new version is a valid part of a curation workflow, so overly broad language might lead to confusion in the other direction, where we might be seen to discourage the creation of new versions. Likewise, I don't think it would be accurate to say that OCFL cannot be part of a workflow system, but that the decision in the application on when to write the output the workflow should reflect some idea of the settled state of the object.

The most appropriate analogue I can think of to help clarify what I mean is that of a git commit. You wouldn't create a new commit after every keystroke or line, since that would be too many. In idealized use, a git commit generally represents the settled state of an application's source. Most programmers have an instinctive sense of the difference, but specifying what is and is not a 'commitable' change varies widely, from a single character to several hundred lines.

Likewise, creating an OCFL version when an object passes from one stage to another in an accession workflow, for virus scanning, metadata checking, metadata double-checking, file format adjustments, etc. likely does not represent the 'settled state' of an object, but it is entirely possible that a routine operation (such as adding a missing file or changing a mistake in the metadata) can lead to a new version being made.

The challenge will be how to communicate that in a way that encourages implementers to see versioning as a natural part of long-term preservation, while also ensuring that implementers do not shy away from OCFL because the requirements of the spec introduce inefficiences for storing the specifics of their application state.

birkland commented 5 years ago

I agree about a conversation about intended uses in the community. Of late, there has been increased input from members of Fedora's community (which has been nice to see!), which has caused focused scrutiny on Fedora's relationship with OCFL. Fedora straddles a line between access, management, and preservation

The initial proposed relationship between fedora and OCFL (the one proposed to our leaders group, and accepted) was one whereby the act of creating an immutable version of an object (as defined in the Fedora API) resulted in publication to OCFL. "unversioned" content that can be updated and mutated at will would be persisted elsewhere until a request to create a version came along. In this scenario, you can see Fedora as supporting a workflow for updating/managing objects, then shipping them off to preservation at defined points.

The problem with the above scenario is that some content in Fedora is "in ocfl", and some isn't. There are some users of Fedora who never create explicit immutable versions of objects. The idea that Fedora isn't really "preserving" anything without explicit action (and the idea that some fedora content may perhaps unknowingly be absent from OCFL) was hard to explain, confusing, and discomforting to some. An alternate solution where Fedora always writes every change to OCFL seemed to get broad approval. The analogy to git is apt here, whereby each mutation of the repository through the API (either on its own, or bundled up with others in a 'transaction') corresponds to a "commit", and the act of creating an explicit version in Fedora is analogous to the act of tagging. The problem here seems to be that it is easy to use the repository in ways that don't necessarily align with the intended purpose of OCFL. On the technical side, this may possibly lead to the ballooning object size problem for some usage patterns.

Maybe @awoods or @rosy1280 can put on your Fedora hats and comment? Do the OCFL editors see either scenario as particularly problematic?

Also, this part caused considerable confusion among the Fedora developers:

The inventory.json file in each version is the canonical record of the files in that version. The copy in the root of the object is to facilitate operations on the object in its most recent known version, and reduce the need for scanning the version directories to determine the most recent version of the object.

The spec reads the opposite way; pretty much everybody took the inventory.json in the object root to be the canonical one, since it's the only one required to exist. Exposing a configuration option to disable the redundant inventory files in each version directory would be a workaround for this issue (allowing a local repository manager to decide if disregarding the SHOULD recommendation in the spec is an acceptable tradeoff). The word "canonical" is a strong word, however. If it is the case that the version directory inventories are truly canonical, then it really ought to be a MUST in the spec. The notion that a legacy system can ignore them and still be OCFL compliant implies they must not be essential to a functioning OCFL-based preservation system.

ahankinson commented 5 years ago

I'll let @awoods or @rosy1280 weigh in about the first bit.

For the second, I can completely see how that would be confusing, and I think that's on us to sort out how we make that clearer. The intention with the use of SHOULD is that it is highly recommended, but for legacy reasons we relaxed it from a MUST. I can't find the exact issue at the moment, but this one mentions the decision: https://github.com/OCFL/spec/issues/293

The reason, however, for having a redundant inventory file is so that no single act of changing the object can result in an unreadable object. We assume that the object is at its highest risk of loss when it is being changed. If we assume that the operation producing vN did not complete successfully and left inventory.json in an invalid state, then it is simply a matter of reverting to vN - 1 to retrieve the last-known good state.

Without this redundancy it is entirely possible for an OCFL object to have an irreconcilable version history and latest state, since it will depend on just a single inventory.json file that may, or may not, be valid, and this validity may not be discovered until far after the operation has completed. (hours, days, weeks, months, years).

rosy1280 commented 5 years ago

First, apologies for being late to this thread and many thanks to @ahankinson for being on top of things. His use of Git commits to explain versioning in OCFL is useful and I agree that providing specifics will create barriers to adoption. Quite frankly explicitly defining what constitutes a version is like asking for the definition of collection, everyone will give you a different answer.

Next, I'd like to better understand what happened during the course of the performance tests @pwinckles did. For the many versions test, it would be good to understand how large the updated file was at v1 and whether or not it grew at each version and by how much. For both tests can you tell us how the inventory.json file grew between versions?

Finally, @birkland you mention a few things that I think I need clarity on. You say that some Fedora content will be in OCFL and some won't. Can you help me understand what content won't be in OCFL? You also mentioned that Fedora might be used in a way that doesn't require users to commit an immutable version. Do you have examples of those use cases? I have to wonder, if someone isn't using Fedora for preservation or management purposes, does it matter that an object isn't in OCFL?

pwinckles commented 5 years ago

In the many versions test, the contents of a single file (the same file each time) was overwritten in each version. This file only contained the current version number, so it was about as small as it could be without being empty (naturally slightly larger for higher version numbers). The object itself contained 9 files in every version, 8 of which never changed. I attached the final inventory file for it earlier.

For the many files test, it was more or less the same setup except instead of overwriting a single file in each version, a new, similarly tiny, file was added in each version.

For both tests can you tell us how the inventory.json file grew between versions?

Are you asking for the rate of change between versions? Or something else?

The tests I ran were ad hoc experiments. More of back of the envelope calculations to see what happens to an object when it is versioned a lot with minimal changes.

Currently, Fedora is planning on putting all content in OCFL. For a period of time it was discussed maintaining unversioned/staged content outside of OCFL and versioned content inside OCFL, but that is not the approach that is presently being pursued.

Fedora allows users to update objects without versioning them. They can choose to stamp a version on the objects or just leave them. In the approach where everything is always stored in OCFL, every change to an object, regardless of whether or not it was versioned in Fedora, would be versioned in OCFL. The question then became, if a user makes a large number of updates (or a small number of updates to an object with a lot of files) to an object, what would happen to the OCFL object? That is why I ran the experiments that spawned this issue.

rosy1280 commented 5 years ago

@pwinckles can you tell us what the size (in bytes) of the inventory.json file at v1, v2, v3, and so on.

Thank you for the explanation. I still have to question whether or not it matters if a user chooses not to commit a version to Fedora. If they aren't choosing to commit a version then it sounds like they aren't using Fedora for preservation or management, at which point, does it matter that it isn't in OCFL.

That being said, I would like to understand better how Hyrax and Islandora commit things to Fedora before identifying which approach makes the most sense. @rotated8 can you tell me if Hyrax commits a new version to Fedora upon each update or is it something you configure in Fedora and Hyrax doesn't care about it.

pwinckles commented 5 years ago

Yes, I can get you the exact numbers when I'm at my work computer on Monday. But, if I remember correctly, the increase is linear, so I would expect the version to version increase for the many versions test to be around 1,870 bytes (sha512 pretty json) and the version to version increase of the many files test to be around 1,118,052 bytes (sha512 pretty json).

birkland commented 5 years ago

@rosy1280 I can't comment about Islandora 7, but I do know that Islandora 8 does not currently use the versioning capabilities of Fedora 5 at the moment, but it's on the radar to figure that out. I'd be surprised if Hyrax does, but I honestly know nothing about it. There hasn't been as much of broad culture of versioning in Fedora 4 and 5 as there had been in Fedora 3, so I suspect it's rare among Fedora 4 or 5 users.

The notion of Fedora6 committing everything to OCFL emerged in the past few weeks, so that's why issues related to lots of versions and "fitness for purpose" of OCFL are emerging just now in community discussion around Fedora.

In the original fcrepo6 proposal, the act of creating a version was the point at which "at rest" content was created and shipped to OCFL. The mutable unversioned content would not have been in OCFL (but would have been durable and rebuildable). To answer your question directly, those who are not using Fedora strictly for preservation of finished "at rest" content do not strictly need OCFL (but would likely be fine with it as long as there is no serious technical drawback to doing so - we're assessing whether disk space or inventory bloat is a serious technical drawback in practice). There are, however, members of the community who feel that the active/unversioned content that has historically been supported by Fedora should specifically not be in OCFL.

pwinckles commented 4 years ago

@rosy1280, I updated the spreadsheet with the data you requested. It's on sheet 2.

rotated8 commented 4 years ago

@rosy1280 To the best of my knowledge (and @no-reply will understand this better than I do), I do not believe Hyrax interacts with Fedora's versioning at the object level, although files may be versioned, if a user explicitly chooses to create one. To create a new version, you are required to upload a new file, and no mechanism exists for creating a version for metadata changes alone.

I will defer to @no-reply for a better understanding.

jrochkind commented 4 years ago

I am pretty sure that hyrax does not use Fedora "versioning" for metadata at all right now. If there is any fedora versioning (keeping track of changes/past versions) of metadata at all going on in hyrax, it is not exposed in any hyrax UI as far as I am aware. There is no way to see or revert to past states of metadata offered in hyrax.

If we imagined hyrax using a fedora that used OCFL as a back end, such that all fedora updates were written to the OCFL store, that would necessarily be a change, as there is no way to "write to OCFL" without "creating a version", so every persisted change to metadata would be "a new version in OCFL", where at present every persisted change to metadata does not, I believe, result in a "fedora version", nor is there a way to "undo" using fedora.

So, if we're worried about how many versions "typical" use would create -- under such a scenario, "how many versions would an object end up with" would be roughly answered by "How many times during the life of an object will/did someone make an edit to metadata and press 'save'" (Or a programmatic/batch 'save' would also count of course). I'm not sure how/if that information is available for existing hyrax installations.

Of course, if hyrax did not send an update to fedora for every time a change to metadata had to be persisted, but only sent an update to fedora for some "preservation version should be created" kind of event, but kept it's "working" persistent data somewhere else (fedora is not it's main persistence store, but just a tool used for making a preservation copy at certain manual or automatic defined points), that would be another story, but require some changes to how hyrax approaches things. It would mean hyrax would need some "persistence store" in addition to fedora, even if it were using fedora.

rosy1280 commented 4 years ago

So if I understand @birkland 's example: When an Islandora user uploads a new file, it replaces the file currently in Fedora -- it does not make a call to Fedora to say "version this new file I'm uploading" it just overwrites the file. @birkland how difficult would it be for Islandora to change that? Or better yet would it even need to change that?

Contrast that with what @rotated8 (and @no-reply just said in a meeting I was in with him) which is that every time a new file is uploaded in Hyrax, Fedora creates a new version of the file. Hyrax uses Fedora's RDF to store metadata. @birkland above you mentioned that some components of an object would not be in OCFL. Is Fedora's RDF something that could change without it being put "in OCFL"? Is Fedora's RDF preserved when an object is put "in OCFL"? Hyrax does have the concept of workflows so I also wonder if its possible for Hyrax to solve this problem itself (assuming its deemed necessary) so that a step in the workflow is "I edited this metadata, now make an immutable version in Fedora." Perhaps that's also a question for @no-reply.

no-reply commented 4 years ago

@rotated8's summary seems right to me: Hyrax does not create object-level versions, but does create new file versions normally when editing files. Creating a file version in Fedora is a routine side effect of editing a file (whether through the UI or through provided internal APIs).

The versions are exposed in the UI through the Edit File interface; the versions tab is linked directly from file info on the main Work page.

2019-09-09-150054_3840x2160_scrot 2019-09-09-150039_3840x2160_scrot

This provides an easy restore. Restoring creates a new version, identical to the selected previous version.

2019-09-09-150447_3840x2160_scrot

rosy1280 commented 4 years ago

@pwinckles Thanks for providing that. Because I'm visual I turned it into a spreadsheet with a chart! https://docs.google.com/spreadsheets/d/1xpbQfgDSIFXXxOw-mKhiRT0MTcnmH8t98lqSBA8U2yQ/edit#gid=350102796

It looks like the rate of change is linear, so the best way to stop the growth is to start at the beginning. I wonder if that's something the editors can look into. (and maybe should be a separate ticket from the rest of what is happening on this thread...).

no-reply commented 4 years ago

Is Fedora's RDF something that could change without it being put "in OCFL"? Is Fedora's RDF preserved when an object is put "in OCFL"? Hyrax does have the concept of workflows so I also wonder if its possible for Hyrax to solve this problem itself (assuming its deemed necessary) so that a step in the workflow is "I edited this metadata, now make an immutable version in Fedora." Perhaps that's also a question for @no-reply.

This seems like a good question to me. The best way that comes to mind for this to be handled on the Hyrax side is to serialize the metadata and store it as a file. I say "best" but this leaves a lot to be desired and is probably better called the "only/least worst" way. As of now, there's no concept of versioned metadata updates, or of Object versions, in Hyrax.

pwinckles commented 4 years ago

It looks like the rate of change is linear, so the best way to stop the growth is to start at the beginning. I wonder if that's something the editors can look into. (and maybe should be a separate ticket from the rest of what is happening on this thread...).

My understanding is that the original point of this thread is to discuss this problem, and somewhere along the line it devolved into implementation details for Fedora and Fedora-based software.

The fact of the matter is that there are serious storage implications for using OCFL to store objects with numerous versions. If this is not something that can/will be addressed in the OCFL spec, then Fedora needs to evaluate how and to what extent it should use OCFL with the understanding that some users may generate a large number of versions. An extreme example of a real-life Fedora 3 object that we looked at today had only 6 files but over 35,000 versions. A back of the envelope estimation for how much space would be required to store that object's cumulative inventory files if it was an OCFL object was over 700GB (sha512 not pretty printed). To me, that doesn't seem reasonable.

rosy1280 commented 4 years ago

@pwinckles thank you for that feedback. Are you telling me that you have an example of an object who's files changed significantly 35,000 times?

pwinckles commented 4 years ago

I cannot speak to the significance of the changes (it's not my object). All I am saying is that several of the Fedora 3 repositories that we examined contained outlier objects with large numbers of versions, which is not something that OCFL handles gracefully. I understand if this is not a usecase that OCFL was designed to support, but it circles back to @bcail's questions in the original post and @birkland's question of "fitness of purpose."

neilsjefferies commented 4 years ago

I would like to see actual use cases that require an object to be versioned that much, rather than it perhaps being the result of suboptimal coding. It would seem that such objects-in-motion should reside in the OCFL workspace area (which is defined), where the content and inventory can be updated in-place in an OCFL compliant structure, but only migrating to the persistent OCFL structure when a version needs long term retention.

pwinckles commented 4 years ago

I’ll defer to someone else to provide use cases, but it would seem to me that the two obvious cases for large numbers of versions are 1) as part of a “workflow system” and 2) updates to a higher-level object that contains references to a numerous and expanding array of child objects. A third possible use case may be certain types of metadata updates.

Let’s consider the deposit directory though. For reference, the spec has the following to say about it:

An OCFL Storage Root MAY contain a directory named deposit, which MAY be empty. Implementations MAY use this directory for assembling new or updated content. Clients performing any other operation, or validating a storage root, MUST ignore the deposit directory if present.

My reading of that is that the deposit directory is intended to be used to construct new object versions immediately prior to moving them to the object root. If the deposit directory is intended to support long-lived version creation (in Fedora’s case it could be indefinitely long), then I have a slew of implementation questions but here are the ones that I think are most pertinent to the spec.

Can versions live in the deposit directory indefinitely? If not, how long is too long?
If I request the HEAD version of an object that has an non-finalized version in the deposit directory, what is returned?
Assuming that read access is supported for non-finalized versions, doesn’t this mean that clients first need to look in the deposit directory for the most up-to-date inventory before looking in the object root?
Are versions under the deposit directory considered “preserved?” That is, is there an expectation that the content will not be lost unless the user explicitly asks the client to delete it?
How are updates to versions under the deposit directory staged to avoid concurrent updates corrupting the object? Updating a version inplace seems fraught with potential problems.

We talked about the deposit directory at some length on Fedora tech calls. I have no problem using it to stage versions before they’re finalized. However, from Fedora’s perspective unversioned content must be durable and long-lived. It feels wrong to me to build an OCFL library that extensively uses the deposit directory to maintain staged content indefinitely without the spec sanctioning this interpretation. Doing so would essentially make the repository unusable by any other OCFL implementation.

neilsjefferies commented 4 years ago

Good questions!

OCFL doesn't care how long things sit in the deposit directory but see Q4.
In the spirit of OCFL, processes that aren't involved with the deposit workflow should only really see the latest finalized version. However, OCFL does not define overlying API's or behaviours per se.
Yes, if you want to bend the rules. To be honest OCFL is not suited to frequent versioning but most of the use cases just need redesign to be more object aware and less database-y in their assumptions. For example, objects with many components are better handled by storing the components and aggregations as distinct OCFL objects with the aggregation itself as a data file rather than relying on the inventory. This is also more in keeping with PCDM and massively shrinks the inventory overhead.
No, it is not. It won't necessarily vanish but nor can it be validated since state cannot be guaranteed. You are right that indefinite use is an OCFL no-no!
File locking and concurrency control is not that hard, per object transaction ID's solve most of the issues (i.e. no actual concurrency, operations are serialised by some mechanism). Under the hood all storage systems do something similar. Some such mechanism is needed regardless of the length of the object creation/update operation.

At the moment, I think the Fedora/Hyrax/Islandora community still has to reach consensus as to what they want Fedora to be and do. Fedora content has to-date not actually been particularly durable, and durability involves some trade-offs.

jrochkind commented 4 years ago

I think this is a side issue, but since you all brought it up:

File locking and concurrency control is not that hard, per object transaction ID's solve most of the issues (i.e. no actual concurrency, operations are serialised by some mechanism). Under the hood all storage systems do something similar. Some such mechanism is needed regardless of the length of the object creation/update operation.

That has not been my conclusion at all trying to approach it -- even without the "deposit" directory, just concurrency control in adding versions to an OCFL object. But I'm not sure what you mean by "per object transaction IDs". We were having extensive discussion of concurrency in Slack over the past few months, and I don't think that concept came up.

Concurrency issues are especially difficult with S3 as storage, which does not offer any "atomic directory move" operation, does not offer any file-locks (like an ordinary file system does), and in fact only guarantees "eventual consistency" on any add/update operations.

In my attempts at working it out, it seems quite difficult to deal with concurrency issues, and if you are using S3 for sure (I think) requires an external system gatekeeping (and probably keeping a copy of at least some parts of S3 inventory/manifests, becuase of S3 "eventual consistency"). But it is not obvious to me a simple way to handle it even on a local file system (which may exclude NFS as well though, note well).

But I actually don't understand what is meant by "per object transaction ID's", so there may be an approach I don't know. An example (say a pseudocode example) would be very helpful, so we're all talking about the same thing, and maybe it's simpler than I think. (Further discussion should probably be somewhere other than this ticket).

whikloj commented 4 years ago

@neilsjefferies I am wondering about something you said

To be honest OCFL is not suited to frequent versioning but most of the use cases just need redesign to be more object aware and less database-y in their assumptions. For example, objects with many components are better handled by storing the components and aggregations as distinct OCFL objects with the aggregation itself as a data file rather than relying on the inventory. This is also more in keeping with PCDM and massively shrinks the inventory overhead.

This seems to indicate that OCFL is designed with a particular data model in mind and perhaps that could be described and would better help us understand how to use it.

For instance, I would assume that for a book object I might contain the book "object" and page "objects" together in a single OCFL object to ensure they stay together to take advantage of the human readable structure.

But it sounds like you are saying it would be better to have each page separate and some file in the book that ties them together.

bbpennel commented 4 years ago

For what its worth, our object with 35k modifications is a collection object where the changes to it are primarily membership changes. The collection has actively managed by staff for roughly 5 years, which additions and rearrangement. The modifications are not to versioned datastreams in this case. My understanding is that fedora intends to take the approach of serializing RDF to OCFL for rebuildability purposes, so clients would have to be written with awareness that membership changes need to be grouped up during bulk ingests (assuming a parent to child relationship). That may not always be practical, if for instance one or two additions were made per day over a long period.

neilsjefferies commented 4 years ago

@jrochkind The problem with S3 is that it isn't a filesystem so paths are really just string metadata stored with each object. A directory move is a find/replace metadata operation on all the objects concerned which is understandably not atomic. Similarly, objects can basically really only be created or deleted and have no intrinsic path (except that based on object ID) so most file-like operations have to be broken down into a combination of create/destroy/metadata update operations that renders them rather non-atomic too. While we haven't looked in detail at object store compatibility (that's for V2) consider the following...

This abstraction between on-storage path and logical path is also used in the OCFL inventory. Note that file renames and path moves inside an object involve only inventory updates and not on-storage updates. The inventory differs from the S3 file-path implementation by grouping path information for an object so that changes can be rendered (more) atomic as a single inventory update. The corresponding checksum file update is the chink in the true atomicity of the operation but it is at least well defined and bounded.

The top level inventory used to define overall object state also acts as a locking mechanism, if the order of operations in the Implementation Notes is adopted and it is updated last, after all other changes have stabilised. The object state for reading is always consistent (bar the checksum caveat noted above).

However, this works only because we have the notion of a discrete version.

The per object transaction ID simply acts like a eTag to prevent multiple processes updating an object once one process has started operations so actaully not relevant here.

Yes, most of these discussion are about external system/application requirements - these are outside the scope of OCFL.

neilsjefferies commented 4 years ago

@whikloj Actually, OCFL is indifferent to your data model but you should design your models to suit your application profile. A book is unlikely to be updated much so you are fine to have it as a single object. If you are going to be updating many of the bits of an object and want to track/preserve the versions separately then it makes sense to disaggregate them.

neilsjefferies commented 4 years ago

@bbpennel I suspected that was the case - it really reinforces the case for separate collection objects. Is there actually a good case for keeping all versions of the collection list rather than using OCFL purge-and-recreate, possibly with a truncated version history? Given that the object is (I assume) a single RDF file that is replaced every time.

pwinckles commented 4 years ago

@neilsjefferies can you clarify what you mean by "OCFL purge-and-recreate" and "truncated version history"? The only similar concept that I can find in the spec and implementation notes is in regards to purging files from objects. Is this what you're referring to?

neilsjefferies commented 4 years ago

@pwinckles Yes, just because OCFL supports versioning doesn't mean you have to. OCFL defines objects at rest but not how they get there - the Implementation Notes are not mandatory. So, not recommended, but I can see use cases for unversioned objects that are always at V1 and just replaced when an update occurs.

pwinckles commented 4 years ago

Interesting... I had never considered using OCFL in that fashion. In my mind, purging was only ever intended to erase files that absolutely needed to be removed from the filesystem.

Based on what you're saying, it would seem like there should be no qualms about supporting a squash-like operation in order to cleanup/trim versions, so long as it resulted in a new object. Is that correct?

neilsjefferies commented 4 years ago

...another approach is to have objects assert their membership of the collection themselves but not vice versa. The collection object effectively defines an index to be created that contains all the objects making the membership assertion but does not need to be versioned when an object is added. RDF inference is kind of made for this sort of optimisation.

jrochkind commented 4 years ago

@neilsjefferies Yes, I understand how S3 works, more or less, at that level.

Are you suggesting S3 is not suitable for OCFL, only things more like "real" file systems are?

There is a lot of interest in OCFL on S3, so there would be a lot of dissapointed people if that's true.

But if that's "waiting until v2", then it would probably be good for Fedora implementers, for instance, to know that they probably ought not try to do OCFL on S3.

My experience with S3 is that enough things are different that a very good design for a local file system may not work at all and require major changes for S3, so I think it is somewhat dangerous to assume that if you avoid thinking about S3 at all in "v1", you will be able to accomodate it in "v2" without major changes.

This conversation is pulling out all sorts of differing assumptions; I think OCFL is being designed for a more limited set of scenarios/use cases than some onlookers (not directly involved in OCFL spec-making but hoping to use it) may be assuming. This may not be a problem exactly, OCFL design process may be going exactly according to plan, but expectations in the surrounding community should probably be set properly.

Yes, most of these discussion are about external system/application requirements - these are outside the scope of OCFL.

I would suggest that OCFL's fitness-for-purpose for use cases ought to be within the scope of OCFL spec construction discussion, and actual implementation considerations ought to be in the scope for discussion. It is fair to say that some use cases are not meant to be supported or not considered for support; but surely you are thinking of some use cases that are meant to be supported, and surely you are considering them when designing the spec, to make sure they, well, work.

It's fine if some scenarios/use cases (say, S3) are outside the purview -- so long as there are enough use cases that are that you still have a community actually wanting to use OCFL!

But I think this discussion reveals some lack of clarity between what use cases OCFL editors feel are "outside of scope", and what potential OCFL adopters have been hoping to do with it. Perhaps the potential OCFL adopters have been wrong, and should not have been trying to do those things with it. But it is hard for them to figure out, that's why we're having this discussion.

Certainly external system/applciation requirements can't be entirely irrelevant to OCFL spec design, if you want to have a spec that actually works for external systems/applications. Are you trying to create a spec without feedback from implementers on what things are hard, or what things require special consideration? That would seem ill-advised.

This whole conversation is confusing me. Also:

OCFL is indifferent to your data model but you should design your models to suit your application profile.

OK, but you're not really talking about "application profile", when you say things like:

If you are going to be updating many of the bits of an object and want to track/preserve the versions separately then it makes sense to disaggregate them.

You are recommending fitting your data model to the particular trade-offs of OCFL. You might not have to disaggregate them if you were storing in something other than OCFL. But for storing in OCFL, you have particular considerations related to OCFL. You can not in fact simply consider your "application profile" in isolation from the particular details of OCFL, in this conversation folks are trying to work out how to do that.

neilsjefferies commented 4 years ago

@jrochkind If you know how S3 works then why are you trying to use it for operations for which it is suboptimal? You appear to want S3 to behave more like a filesystem but appear to be having trouble with that. Amazon have a higher cost file-system storage offering so there are specific reasons for the limitations of S3. Eventual consistency models accept that some level of loss is tolerable in the event of concurrency conflicts as a tradeoff against not having to manage locking so an external mechanism is absolutely necessary!

The aspects of OCFL I indicated describe how it can be used to overcome some of these troublesome aspects of S3, so it has been thought about and OCFL can form part of the required external mechanism. However, it means accepting the design decisions (defined versions, serialised operations etc.) that are explained in fair detail the Spec and the Implementation Notes. C'est la vie.

You are recommending fitting your data model to the particular trade-offs of OCFL.

Yes, exactly that. Just like I design different data models for relational, graph and document databases. Unsurprisingly, my suggestions for disaggregating collections or defining them by membership assertions are very much graph database types of model optimisation.

OCFL is not going to try to be all things to all people. It is about getting objects that change relatively slowly into stable storage and helping preserve them. It's not really about high transaction volumes except perhaps for object creation.

birkland commented 4 years ago

@rosy1280

So if I understand @birkland 's example: When an Islandora user uploads a new file, it replaces the file currently in Fedora -- it does not make a call to Fedora to say "version this new file I'm uploading" it just overwrites the file.

Ah, no. It just creates a new resource (in the LDP sense) as far as I understand. Translated to operations against OCFL, it'd be equivalent to creating a new object containing the new file. So it's not using using the versioning API per se, but if you squint hard enough it almost looks like it could be close to a user-space versioning scheme.

above you mentioned that some components of an object would not be in OCFL. Is Fedora's RDF something that could change without it being put "in OCFL"? Is Fedora's RDF preserved when an object is put "in OCFL"?

Currently, no proposal makes a distinction between rdf/metadata and binaries as far as persistence to OCFL is concerned, they all would be serialized as files alike. It looks like both hydra and Islandora share the characteristic where editing a metadata form in the UI can result in an update to Fedora. This (changes to RDF) is actually the vast majority of our updates as well.

Also,

So, not recommended, but I can see use cases for unversioned objects that are always at V1 and just replaced when an update occurs.

Ultimately, I think this thread is concerned with "in motion" objects, since that has always been part of Fedora's historical use cases. It uniquely bridges the world between workflows, access, and preservation, and the goal is to do that in the most rational manner achievable. The interest and excitement around OCFL adoption has been remarkable. Success depends on achieving a solution that would be recommend as far as usage of OCFL is concerned. Disk space usage is a technical speed bump, but not entirely the crux of the matter. I wonder what the best venue for hashing this out is. Clearly, approval from the OCFL editors is desired, despite the nature of the topic being almost entirely out of the narrow scope of the spec as it has been defined for 1.0.

no-reply commented 4 years ago

it seems like this is becoming a very different discussion from the issue topic, and i wonder if it wouldn't be possible to summarize the issues related to #367, as they stand now.

i would say that i also have concerns about the issue of update frequency as it relates to data models. this is certainly an issue in the Samvera community (very much along the lines of @whikloj's examples), as i'm sure folks are aware. the suggestion in https://github.com/OCFL/spec/issues/367#issuecomment-530118258 in particular resulted in a substantial development effort some years back.

i also think that OCFL's design goals should be clearly reflected in the use cases it's recommended for. should i consider OCFL in relation to my S3-compatible object store? what about my distributed block store? as workloads move behind containerized abstractions, i'm increasingly wondering whether i'll have any normal "filesystems" in a few years. i'm definitely very curious about where OCFL slots into my technology stack as my environment shifts, and my applications become less and less invested in data atomicity.

neilsjefferies commented 4 years ago

I'm saying S3 is suboptimal for Fedora. Trying to implement a system with workflow elements over a storage layer that intrinsically has no concept of updates, versioning and ACID-ity is going to be painful. OCFL over the top helps this situation since inventories provide locking and consistent state mechanisms but at the cost of requiring the notion of versioning. I think you will find that most approaches will end up having to invent something similar.

If anything, the gap between S3's design and Fedora's requirements is probably greater than that between Modeshape and Fedora.

I believe the Archipelago Repository platform has already implemented OCFL over an object store in some form - but not S3.

On 2019-09-11 4:22, Jonathan Rochkind wrote:

I am not the only one who was considering OCFL on S3. If it is the official recommendation of the OCFL editors/spec that OCFL probably won't work with S3 as it is "suboptimal" for it, if you make that clear you will save a lot of people a lot of trouble.

-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or mute the thread [2]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/OCFL/spec/issues/367?email_source=notifications\u0026email_token=AJBHHY7ZKKVD45OVVTS6HC3QJBQAVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NEE6A#issuecomment-530203256", "url": "https://github.com/OCFL/spec/issues/367?email_source=notifications\u0026email_token=AJBHHY7ZKKVD45OVVTS6HC3QJBQAVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NEE6A#issuecomment-530203256", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://github.com/OCFL/spec/issues/367?email_source=notifications&email_token=AJBHHY7ZKKVD45OVVTS6HC3QJBQAVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NEE6A#issuecomment-530203256 [2] https://github.com/notifications/unsubscribe-auth/AJBHHYYCB6J2KUAHR64EIP3QJBQAVANCNFSM4IUAFNGQ

bcail commented 4 years ago

A couple notes about the SHA512 "SHOULD" in the spec:

https://software.intel.com/en-us/articles/intel-sha-extensions - as I understand it, these can speed up SHA256 to the point where it would be faster than 512 on Intel processors. Maybe choosing 512 over 256 for performance reasons should be discussed more?
https://blog.skullsecurity.org/2012/everything-you-need-to-know-about-hash-length-extension-attacks - are hash length extension attacks a concern for OCFL?

bcail commented 4 years ago

Possibilities for reducing the size of objects:

Allow creating a mutable version in OCFL. This could be clearly marked in the object, and notes put in the spec that this loses preservation benefits. Eventually, if desired, the mutable version could be stamped/committed and become immutable, but it would be treated as a normal version even before becoming immutable. The spec could require the mutable version to be the most recent (ie. you can't create another version until you've made the previous one immutable).
In the inventory.json file versions section, allow referring to a file in the manifest by a unique truncated first part of the hash, instead of putting in the whole hash.
In the inventory.json manifest section, add an unique ID field for each file, which can then be used in the versions section to refer to the file.

OK, tear them apart. :)

pwinckles commented 4 years ago

Here's a crack at a summary. It's a long thread, so let me know if I missed something.

If Fedora stores every object change in OCFL, there is the potential for generating a large number of “unnecessary” or “insignificant” versions. There are two consequences of this:

Content binaries that are part of “insignificant” versions will be preserved unnecessarily
Inventory files will become bloated

This ticket was original created to discuss the problem of inventory bloat, but the topics discussed have been wide ranging.

The following possible solutions have been proposed:

Reduce the size of the inventory file [Only addresses inventory bloat]
1. Do not pretty print JSON
2. Use SHA-256 instead of SHA-512
3. Use a shorter file ID within the inventory file
Reduce the number of inventory files [Only addresses inventory bloat]
1. Do not keep copies of the inventory within the version directory
2. Only keep the N most recent inventory copies
Rewrite objects to remove unwanted versions
1. Implement an “unversioned” object by replacing the object on every update so that it is always at v1
2. Implement a squash operation to create a new copy of an object with unwanted versions removed
Allow for a mutable head version
1. Store a mutable head version of an object in the deposit directory – however, versions cannot be stored here indefinitely
2. Store a mutable head version in the object root
Change object modeling and OCFL interactions
1. Don’t model collection membership in an aggregate object
2. Don’t put an object in OCFL that is being actively worked on

neilsjefferies commented 4 years ago

Valid point.

Bizarrely, Intel SHA extensions were released on precisely one processor line at time...the feeble low power Goldmonts that appeared almost nowhere. However, AMD added support in Ryzen CPU's and Intel is finally getting its act together and may support it in upcoming CPU's! Linux has support for the instructions now, no idea about Windows. ARM processors also have SHA acceleration instructions in most modern iterations.

For the second point, OCFL uses hashes for content addressing rather than key exchange so that is not a concern.

On 2019-09-11 14:14, bcail wrote:

A couple notes about the SHA512 "SHOULD" in the spec:

https://software.intel.com/en-us/articles/intel-sha-extensions - as I understand it, these can speed up SHA256 to the point where it would be faster than 512 on Intel processors. Maybe choosing 512 over 256 for performance reasons should be discussed more?

https://blog.skullsecurity.org/2012/everything-you-need-to-know-about-hash-length-extension-attacks

are hash length extension attacks a concern for OCFL?

-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or mute the thread [2]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/OCFL/spec/issues/367?email_source=notifications\u0026email_token=AJBHHY3TQPIDPISRH6N262TQJDVKVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ON6XY#issuecomment-530374495", "url": "https://github.com/OCFL/spec/issues/367?email_source=notifications\u0026email_token=AJBHHY3TQPIDPISRH6N262TQJDVKVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ON6XY#issuecomment-530374495", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://github.com/OCFL/spec/issues/367?email_source=notifications&email_token=AJBHHY3TQPIDPISRH6N262TQJDVKVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ON6XY#issuecomment-530374495 [2] https://github.com/notifications/unsubscribe-auth/AJBHHYYM55PKCG665IXVN4TQJDVKVANCNFSM4IUAFNGQ

whikloj commented 4 years ago

@pwinckles

Content binaries that are part of “insignificant” versions will be preserved unnecessarily

You mean binaries that are changing insignificantly, right? Because otherwise I thought we only store them once and just refer to them.

pwinckles commented 4 years ago

@pwinckles

Content binaries that are part of “insignificant” versions will be preserved unnecessarily

You mean binaries that are changing insignificantly, right? Because otherwise I thought we only store them once and just refer to them.

I mean if a user is making incremental changes to a binary before getting it into a state that they would consider to be "settled." This is the case that Paul mentioned a few tech calls ago, where his organization deals with video files that they make edits to, save to Fedora, make more edits, and repeat. In an ideal case most would likely only want the "settled" binaries to be versioned.

julianmorley commented 4 years ago

In reference to @pwinckles summary, I'm all for 5.ii "Don’t put an object in OCFL that is being actively worked on". We may need to do a better job of explaining the goals of OCFL; it's for objects at rest that are ready for long-term preservation, possibly on WORM storage.

As a point of reference, Stanford's preservation system, which uses a very similar versioning paradigm, has > 1.7 million objects / 650TB of data in it, some over 10 years old, and our highest version number is 22 and our mean average is 2.88.

Our Fedora instance has a separate workspace and datastore, where daily work occurs. That datastore is backed up just like any other line of business application. That does not backup every file change as it occurs, as that would be nuts: it backs up any changes to files or that database that occurred over the previous 24 hours.

rosy1280 commented 4 years ago

discussion with fedora committers was: where is a mutable head created and can the spec include language that makes this more clear.

OCFL / spec

Disk space usage and version inventory files #367

Links:

Links: