paraseba commented 2 days ago

This doc has been moved to https://github.com/earth-mover/icechunk/pull/436

What we need

Virtual chunks can be stored in all supported object stores
Multiple credentials sets for virtual chunk resolution
Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk
Be able to know all the "places" pointed by a repo, without having to load every manifest in every version
Object level change detection (not serving a virtual chunk if the containing file changed)

Discussion

Chunks url styles

Many chunks from a few objects, the prefix can be the file
- Example: s3://some-bucket/some-prefix/some-file.nc is pointed to by 10k chunks.
- Having between 10 and 10k chunks from the same file is not unusual
Virtual zarr: each chunk comes from a different object, the prefix is not the object, we still need an extra string
- Example: s3://some-bucket/some-prefix/c/0/0/1 is pointed to by only 1 chunk. But there are many objects like it.
- There could be tens of millions of chunks (so different urls)

Prefixes

Storing the full URL for each chunk duplicates a lot of data in the manifest, making it large. The idea is to try to extract common prefixes and identify them with a short id. Then the manifest only needs to identify the prefix id + maybe an extra string + length + offset.

In the first url style above, the prefix could be the full file, s3://some-bucket/some-prefix/some-file.nc. In the second example, there are options for the prefix, it could be s3://some-bucket/some-prefix/c/0/0/1, or s3://some-bucket/some-prefix/c/0/0 or s3://some-bucket/some-prefix/c/0 or s3://some-bucket/some-prefix/c, with different numbers of chunks under it in each case.

Intelligence is needed to optimize what is consider a good prefix. If done badly, we could have:

as many prefixes as chunks, for example by selecting s3://some-bucket/some-prefix/c/0/0/1 in the second example above
too large of an extra string in each chunk, for example, if we select s3://some-bucket as the prefix each chunk needs to add something like some-prefix/c/0/0.

Should prefixes be per snapshot or global?

The advantage of per snapshot is that we don't need to load/parse all the prefixes if the current snapshot uses only a few. A case for this would be a virtual dataset that has been materialized in a newer version. If prefixes are per-snapshot we wouldn't need to load any prefixes.

On the other hand, requirement 4 asks for the full list of "dependencies" for the dataset, and that means for all versions, which requires a global list of prefixes. Without a global list, we would need to fetch the list of prefixes of each snapshot.

Who creates/decides the prefixes

Obviously, generating the list of prefixes given the list of virtual chunks is doable automatically. But it's a tricky optimization problem. One strategy could be to generate a trie with all chunk urls, and then use some heuristic to select prefixes using heavily populated nodes. It's tricky and potentially computationally expensive for large repos.

An alternative is to let the user dictate the prefixes. This way, they can optimize them knowing beforehand what virtual chunks they will add. If they don't optimize, they get large manifests without prefix extraction. We still need to satisfy requirements 2 and 4, which would require automatic extraction of at least the bucket and platform.

Additionally, the optimization problem can be treated as an extra process (similar to compaction or garbage collection), and ran offline on demand.

Templates instead of prefixes

Using templates would be more efficient, example:

s3://some-bucket/some-prefix/1-analysis-output-for-model-abc.nc
s3://some-bucket/some-prefix/2-analysis-output-for-model-abc.nc

If we use the template s3://some-bucket/some-prefix/{}-analysis-output-for-model-abc.nc each chunk could have a couple of extra characters only, filling the placeholders in the template.

Supporting multiple clouds

When the user adds a virtual ref, they need to indicate the details of the platform to retrieve it. These include what cloud it is (so we know how to set up credentials), region, endpoint url, port and other settings. It looks a lot like our S3Config type (or an expanded version of that).

Supporting multiple credentials

Credentials are a read time concern. When the user writes the virtual chunk, they don't necessarily know what credentials set will be used to read it.

Credentials need to be specified later, when the chunks are going to be read. For example, by indicating in runtime configuration a set of credentials for each virtual chunk bucket/prefix/template.

Design

Explicit containers for virtual chunks

Each repo has a global list of VirtualChunkContainers

// this is pseudo-code
struct VirtualChunkContainer {
  name: String,  // something like model-output
  platform: Platform,  // something like S3
  url_template: String, // something like some-bucket/some-prefix/output/{}/{}.nc
  default_template_arguments: Vec<String>,
  cloud_client_config: Map<String, Any>,  // something like {endpoint_url: 8080}
}

Users can add to the list and edit members but not delete.

Global config

The list of chunk containers is stored in a new global Config object. This object is serialized using a strategy similar to branches, where new files are created using put-if-not-exists. There is a single configuration per repo, and all reads and writes are done reading the latest configuration. In the initial implementation, the "history" of configuration is present only for reference and user inspection by hand.

// this is pseudo-code
struct Config {
  inline_chunk_threshold_bytes: u16,
  virtual_chunk_containers: Vec<VirtualChunkContainer>,
}

The Config object can act as a configuration default when users don't override. But different fields of configuration may have different behaviors.

New virtual chunk ref

We change from

pub struct VirtualChunkRef {
    pub location: VirtualChunkLocation, // this is basically a String for the URL today
    pub offset: ChunkOffset,
    pub length: ChunkLength,
}

To

pub struct VirtualChunkRef {
    pub container_index: u16,
    pub container_template_args: Vec<Option<String>>,
    pub offset: ChunkOffset,
    pub length: ChunkLength,
    pub last_modified: Option<u32>,
}

The last_modified field contains an optional unix epoch in seconds. If set for a chunk, retrieval will fail if the file was modified after this time. We use this instead of an ETag or checksum to save space on the manifests. u32 should be enough until year ~ 2100.

Reading virtual chunks

When a user create a repository instance, they can specify credentials for each VirtualChunkContainer, pointing to them by name.
They can also set default VirtualChunkContainer credentials, that will be used for any container not specifically set
If no credentials are specified the same as for the main Storage will be used (if platforms are the same)
On read of a virtual chunk:
- The VirtualChunkContainer is resolved from configuration.
- An object storage client is instantiated. We'll need a cache here, so we don't keep creating new instances for each chunk
- The client is initialized with the credentials passed (or defaulted).
- The container template is expanded with the template_args in the virtual chunks. If there are no enough template arguments or some are None, they are replaced by the default template arguments in the container. If there are too many, last ones are ignored.
- The chunk is fetched from the results URL.

Writing virtual chunks

Users must first update the repo configuration to include the needed VirtualChunkContainers.
When the user writes a virtual chunk ref, they pass in the VirtualChunkRef the index for the container and the template expansion arguments. An error is produced if the container is not found.
This is a bit inconvenient, having to declare the container before being able to write. An alternative would be to automatically create a container that includes only the bucket name as template. On writes we try to match all templates against the url, and use the first one that matches, if non does, we create a new container. We could add that as a new function in the future if it's not required on first version.

Check requirements are satisfied

Virtual chunks can be stored in all supported object stores

Now the virtual chunk containers inform what is the platform, and we can create the appropriate object store client at read time. These clients are cached for performance.
Multiple credentials sets for virtual chunk resolution

Now the user initializes a Repo instance by passing, if needed, a credential set for each virtual chunk container. If a chunk is read for which there are no credentials, we default to a common set of credentials for virtual chunks, and ultimately, to the Storage credentials if on the same platform.
Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk

For large prefixes, the template + arguments creates some savings. For example, a template for virtual zarr like some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{}/{}/{}. On the other hand, the template arguments vector has a 24 character overhead, on top of the values themselves. So it's not a lot of savings for smaller prefixes.
Be able to know all the "places" pointed by a repo, without having to load every manifest in every version

This can now be retrieved directly from the repository config. It's basically the set of virtual chunk containers (or a subset at least)
Object level change detection (not serving a virtual chunk if the containing file changed)

We optionally store the last-modified timestamp of the object containing each chunk. This is not optimal, a hash would be better, but it takes more space. Also not optimal, storing it per chunk, when many chunks come from the same object, but we don't have other easy place to store it if they are using containers that include more than one object. We could add more configuration to optimize this in the future.

TomNicholas commented 1 day ago

Object level change detection (not serving a virtual chunk if the containing file changed)

I was thinking about this yesterday and planning to raise an issue about it. I think it would make the whole virtual references idea a lot more robust if icechunk could flag whenever the last modified timestamp on an object was later than the commit timestamp, because there is basically no scenario in which that is correct.

Be able to know all the "places" pointed by a repo

This would also be very useful, else widespread use of virtual references could lead to really complicated compound virtual datasets...

For large prefixes, the template + arguments creates some savings. For example, a template for virtual zarr like some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{}/{}/{}.

FWIW VirtualiZarr could also do this for the in-memory manifest. But given that the in-memory size didn't seem to be a bottleneck in #401 then I don't think this is particularly important. The idea is also sort of related to https://github.com/zarr-developers/VirtualiZarr/issues/238#issuecomment-2386229261.

Having to declare the container before being able to write, is a bit inconvenient. An alternative would be to automatically create a container that includes only the bucket name as template.

I would very much like to keep the simplicity of vds.virtualize.to_icechunk(store) in virtualizarr if we can. This is totally an implementation detail that users won't want to/shouldn't have to care about. In order, I would prefer:

Just do it the inefficient way by default
OR auto-choose a performant prefix based on the current snapshot
Expose something extra to users.

Maybe (3) could be accompanied by a warning above some arbitrary threshold? e.g. "You are about to create 2GB of virtual references - you might want to optimize the on-disk size by explicitly choosing a suitable prefix"

paraseba commented 1 day ago

Great feedback @TomNicholas, thank you.

if icechunk could flag whenever the last modified timestamp on an object was later than the commit timestamp

Yes, with these changes to the format we can implement an interface that sets the last-modified timestamp either to the write time, or to anything the user passes. So even by default, we can get this "don't pull changed data" behavior.

Just do it the inefficient way by default
OR auto-choose a performant prefix based on the current snapshot
OR Expose something extra to users.

The issue with doing nothing, is not as much the increase in manifest size, but requirement number 4. We want to know all the "places" we depend on, so we need to "extract" some form of prefix at least. I think your second option is doable, by building some way to generate the containers as we go. It's not going to be optimal, but it should work. Think something like a container per bucket.

In general, I think this is mostly about setting up the repository config well. If that's done, then while adding virtual references it should be doable to match the right container. I'll think more about it.

paraseba commented 1 day ago

I realize I don't like storing the Config with the strategy we use for branches, because it requires a list + a get. Even if the first of those is concurrent with fetching the refs, or a snapshot, it's not ideal. We could use the new PUT if-match functionality in S3, but this means this would be the first place where we update files in place. More thought needed.

rabernat commented 1 day ago

There's a lot to unpack here. (Meta point: I'm finding it hard to "review" a GitHub issue and am missing a more Google-docs-like experience. Maybe for the next one of these, we just use a google doc? Or make the design doc itself a doc in the repo and review it via PR?)

Anyway, here are a few thoughts, not particularly organized

I'm very :-1: on creating a new mechanism for storing "config" outside of the existing snapshot system, particularly if it involves transactional updates. I would much rather create a new type of metadata file that is referenced from the snapshot. I'd like the entire state of the store to be self contained within a snapshot.

What if we created a manifest of every single virtual file that is referenced in the repo? How big could this be? it's hard to imagine more than, say 1_000_000 virtual files? But if so, we could split it into multiple manifests, as we do with chunks.

This would bring some big advantages.

When writing this virtual file manifest, we could examine the virtual file names and determine the optimal encoding (prefix, templating, etc.) using heuristics. It's much more feasible to do this if we're operating on the (much smaller) set of virtual files, rather than the potentially huge set of chunks themselves.
Each file would have a unique ID. We could store file-level metadata there (e.g. ETAG, last modified, etc.) and not duplicate this repetitively in the chunk manifest. Then the chunk manifest could become simply
```
pub struct VirtualChunkRef {
  pub virtual_file_id: VirtualFileID,
  pub offset: ChunkOffset,
  pub length: ChunkLength,
}
```

And some disadvantages:

We have a new data structure to load when reading virtual refs.
We have to have a different data structure for uncommitted virtual refs vs. committed ones, because we don't know a-priori the file IDs (actually, could we do something smart with URL hashing here?)

Basically I'm suggesting we normalize our schema and factor out virtual files into an entity of their own.

I have not fully thought through how this proposal interacts with the concept of a VirtualChunkContainer. I guess we need to decide on what we mean by "location" for data. For me, it's enough to know

The platform (e.g. S3)
The bucket name
[Optional] The region
[Optional] The endpoint URL

This is what's needed to provide credentials. Everything else is part of the key and doesn't need to be exposed at this level.

paraseba commented 1 day ago

Moved this to a PR #436 to make it easier to iterate and get feedback

earth-mover / icechunk

Virtual chunks improvements #430

This doc has been moved to https://github.com/earth-mover/icechunk/pull/436

What we need

Discussion

Chunks url styles

Prefixes

Should prefixes be per snapshot or global?

Who creates/decides the prefixes

Templates instead of prefixes

Supporting multiple clouds

Supporting multiple credentials

Design

Explicit containers for virtual chunks

Global config

New virtual chunk ref

Reading virtual chunks

Writing virtual chunks

Check requirements are satisfied