OCFL / Use-Cases

A repository to help capture, track, and discuss use cases for OCFL. Issues-only, please.
7 stars 0 forks source link

Defining a repository from peer storage roots #43

Closed marcolarosa closed 6 months ago

marcolarosa commented 2 years ago

[Moved from the spec issues repository as this describes a new use case of handling multiple storage roots making up one repository. It includes both the aggregation of content in multiple storage roots and possibly replication of content.]

This may be a part of issue OCFL/spec#22 and it certainly follows on from the comment.

My institution can't provide a single 200TB volume (!). But they can give me 2 x 70TB and a 60TB volume. So for my use case I now need to have 3 OCFL filesystems that I interact with as a single unit from my service.

Given this, it would be nice to be able to define metadata at the repository level that says this filesystem is a part of a larger set of peers. Nice to haves would include defining a priority for each peer and perhaps the storage tier. That way, clients can make smart decisions about ranking peers by tier and then priority (I imagine these are properties defined by the administrators provisioning the storage).

The justification for this is that any connecting service or user inspecting the filesystem can identify that it is part of a larger set.

For example - a storage.json or some such with content like:

{
  peers: [
    { 
       type: 'filesystem',
       mountpoint: /mnt/ocfl-repo1
       priority: 1,
       tier:  'hot'
    },
    { 
       type: 'filesystem',
       mountpoint: /mnt/ocfl-repo2
       priority: 2,
       tier:  'cold'
    },
    { 
       type: 's3'
       endpointUrl: undefined (means aws S3) or URL (means something like a local minio instance),
       forcePathStyle: true, false or undefined (=false) (required for minio),
       priority: 2,
       tier:  'warm'
    },
    { 
       type: 'filesystem',
       mountpoint: /mnt/ocfl-repo3,
       priority: 1,
       tier:  'hot'
    },
  ]
}

In this model priority can be any sequential number and class could be 'hot', 'warm', 'cold' to dovetail with typical nomenclature used in the industry.

zimeon commented 2 years ago

IMO this is distinct from OCFL/spec#22.

I think the idea of having a way to describe that a storage root contains partial content for a "repository" that is spread across multiple storage roots, or that one or more replica copies of a storage root exist, is interesting. I think there are perhaps different requirements for these two use cases. For example, the notions of priority and tier seem relevant only for the replica use case (where one might select which to access based on the values). The other thing I wonder about is whether this is best express inside a storage root (perhaps as an extension) or would be defined at some as yet undefined higher level of configuration/assembly.

rosy1280 commented 10 months ago

Feedback on Use Cases

In advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments.

Polling on Use Cases

In addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as Proposed: In Scope for version 2. You can contribute to the poll for this use case by reacting to this comment. The following reactions are supported:

In favor of the use case Against the use case Neutral on the use case
šŸ‘šŸ¼ šŸ‘ŽšŸ¼ šŸ‘€

The poll will remain open through the end of February 2024.

bbpennel commented 10 months ago

I understand needing data to be distributed across many storage locations/options, but I'm not sure I totally understand why OCFL needs to be aware of this. It would be helpful to hear more about what is gained by having all the storage roots in one OCFL repository, versus having an application layer above OCFL be aware of multiple repositories. Would the OCFL specification be moving towards handling additional functions like replication, tiering and load balancing, or is it primarily for ease of discovery by a client without needing to keep track of multiple repositories?

srerickson commented 10 months ago

I agree with @bbpennel -- this feels to me like functionality that doesn't need to be part of the core spec. Perhaps there is a reason this can't be implemented as an extension, but I don't see it.

marcolarosa commented 10 months ago

Wow - this is a blast from the past!

I've long since moved on from that project but we decided quite a while ago to stop using OCFL altogether. The complexity of the spec and the compromises we were required to accept just didn't stack up. I don't know if the project will reconsider OCFL in the future but I do know it won't be using the architecture described in this ticket (which we weren't happy about in any case) so I think this can be canned.

zimeon commented 10 months ago

I agree with other comments that this should not be part of the core OCFL specification. I think we would need to see experiments combining individually valid OCFL Storage Roots to explore what would be needed at the core level and could not be implemented through a separate higher-level specification

zimeon commented 6 months ago

2024-02-29 Editors' agree that we will close as out of scope. Comments do not support inclusion in the spec and the original institutional use case no longer applies. Voting at time of closing is -2.