earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
306 stars 19 forks source link

Virtual chunks improvements #430

Open paraseba opened 2 days ago

paraseba commented 2 days ago

This doc has been moved to https://github.com/earth-mover/icechunk/pull/436

What we need

  1. Virtual chunks can be stored in all supported object stores
  2. Multiple credentials sets for virtual chunk resolution
  3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk
  4. Be able to know all the "places" pointed by a repo, without having to load every manifest in every version
  5. Object level change detection (not serving a virtual chunk if the containing file changed)

Discussion

Chunks url styles

Prefixes

Storing the full URL for each chunk duplicates a lot of data in the manifest, making it large. The idea is to try to extract common prefixes and identify them with a short id. Then the manifest only needs to identify the prefix id + maybe an extra string + length + offset.

In the first url style above, the prefix could be the full file, s3://some-bucket/some-prefix/some-file.nc. In the second example, there are options for the prefix, it could be s3://some-bucket/some-prefix/c/0/0/1, or s3://some-bucket/some-prefix/c/0/0 or s3://some-bucket/some-prefix/c/0 or s3://some-bucket/some-prefix/c, with different numbers of chunks under it in each case.

Intelligence is needed to optimize what is consider a good prefix. If done badly, we could have:

Should prefixes be per snapshot or global?

The advantage of per snapshot is that we don't need to load/parse all the prefixes if the current snapshot uses only a few. A case for this would be a virtual dataset that has been materialized in a newer version. If prefixes are per-snapshot we wouldn't need to load any prefixes.

On the other hand, requirement 4 asks for the full list of "dependencies" for the dataset, and that means for all versions, which requires a global list of prefixes. Without a global list, we would need to fetch the list of prefixes of each snapshot.

Who creates/decides the prefixes

Obviously, generating the list of prefixes given the list of virtual chunks is doable automatically. But it's a tricky optimization problem. One strategy could be to generate a trie with all chunk urls, and then use some heuristic to select prefixes using heavily populated nodes. It's tricky and potentially computationally expensive for large repos.

An alternative is to let the user dictate the prefixes. This way, they can optimize them knowing beforehand what virtual chunks they will add. If they don't optimize, they get large manifests without prefix extraction. We still need to satisfy requirements 2 and 4, which would require automatic extraction of at least the bucket and platform.

Additionally, the optimization problem can be treated as an extra process (similar to compaction or garbage collection), and ran offline on demand.

Templates instead of prefixes

Using templates would be more efficient, example:

If we use the template s3://some-bucket/some-prefix/{}-analysis-output-for-model-abc.nc each chunk could have a couple of extra characters only, filling the placeholders in the template.

Supporting multiple clouds

When the user adds a virtual ref, they need to indicate the details of the platform to retrieve it. These include what cloud it is (so we know how to set up credentials), region, endpoint url, port and other settings. It looks a lot like our S3Config type (or an expanded version of that).

Supporting multiple credentials

Credentials are a read time concern. When the user writes the virtual chunk, they don't necessarily know what credentials set will be used to read it.

Credentials need to be specified later, when the chunks are going to be read. For example, by indicating in runtime configuration a set of credentials for each virtual chunk bucket/prefix/template.

Design

Explicit containers for virtual chunks

// this is pseudo-code
struct VirtualChunkContainer {
  name: String,  // something like model-output
  platform: Platform,  // something like S3
  url_template: String, // something like some-bucket/some-prefix/output/{}/{}.nc
  default_template_arguments: Vec<String>,
  cloud_client_config: Map<String, Any>,  // something like {endpoint_url: 8080}
}

Global config

// this is pseudo-code
struct Config {
  inline_chunk_threshold_bytes: u16,
  virtual_chunk_containers: Vec<VirtualChunkContainer>,
}

New virtual chunk ref

We change from

pub struct VirtualChunkRef {
    pub location: VirtualChunkLocation, // this is basically a String for the URL today
    pub offset: ChunkOffset,
    pub length: ChunkLength,
}

To

pub struct VirtualChunkRef {
    pub container_index: u16,
    pub container_template_args: Vec<Option<String>>,
    pub offset: ChunkOffset,
    pub length: ChunkLength,
    pub last_modified: Option<u32>,
}

Reading virtual chunks

Writing virtual chunks

Check requirements are satisfied

  1. Virtual chunks can be stored in all supported object stores

    Now the virtual chunk containers inform what is the platform, and we can create the appropriate object store client at read time. These clients are cached for performance.

  2. Multiple credentials sets for virtual chunk resolution

    Now the user initializes a Repo instance by passing, if needed, a credential set for each virtual chunk container. If a chunk is read for which there are no credentials, we default to a common set of credentials for virtual chunks, and ultimately, to the Storage credentials if on the same platform.

  3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk

    For large prefixes, the template + arguments creates some savings. For example, a template for virtual zarr like some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{}/{}/{}. On the other hand, the template arguments vector has a 24 character overhead, on top of the values themselves. So it's not a lot of savings for smaller prefixes.

  4. Be able to know all the "places" pointed by a repo, without having to load every manifest in every version

    This can now be retrieved directly from the repository config. It's basically the set of virtual chunk containers (or a subset at least)

  5. Object level change detection (not serving a virtual chunk if the containing file changed)

    We optionally store the last-modified timestamp of the object containing each chunk. This is not optimal, a hash would be better, but it takes more space. Also not optimal, storing it per chunk, when many chunks come from the same object, but we don't have other easy place to store it if they are using containers that include more than one object. We could add more configuration to optimize this in the future.

TomNicholas commented 1 day ago

Object level change detection (not serving a virtual chunk if the containing file changed)

I was thinking about this yesterday and planning to raise an issue about it. I think it would make the whole virtual references idea a lot more robust if icechunk could flag whenever the last modified timestamp on an object was later than the commit timestamp, because there is basically no scenario in which that is correct.

Be able to know all the "places" pointed by a repo

This would also be very useful, else widespread use of virtual references could lead to really complicated compound virtual datasets...

For large prefixes, the template + arguments creates some savings. For example, a template for virtual zarr like some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{}/{}/{}.

FWIW VirtualiZarr could also do this for the in-memory manifest. But given that the in-memory size didn't seem to be a bottleneck in #401 then I don't think this is particularly important. The idea is also sort of related to https://github.com/zarr-developers/VirtualiZarr/issues/238#issuecomment-2386229261.

Having to declare the container before being able to write, is a bit inconvenient. An alternative would be to automatically create a container that includes only the bucket name as template.

I would very much like to keep the simplicity of vds.virtualize.to_icechunk(store) in virtualizarr if we can. This is totally an implementation detail that users won't want to/shouldn't have to care about. In order, I would prefer:

  1. Just do it the inefficient way by default
  2. OR auto-choose a performant prefix based on the current snapshot
  3. Expose something extra to users.

Maybe (3) could be accompanied by a warning above some arbitrary threshold? e.g. "You are about to create 2GB of virtual references - you might want to optimize the on-disk size by explicitly choosing a suitable prefix"

paraseba commented 1 day ago

Great feedback @TomNicholas, thank you.

if icechunk could flag whenever the last modified timestamp on an object was later than the commit timestamp

Yes, with these changes to the format we can implement an interface that sets the last-modified timestamp either to the write time, or to anything the user passes. So even by default, we can get this "don't pull changed data" behavior.

Just do it the inefficient way by default
OR auto-choose a performant prefix based on the current snapshot
OR Expose something extra to users.

The issue with doing nothing, is not as much the increase in manifest size, but requirement number 4. We want to know all the "places" we depend on, so we need to "extract" some form of prefix at least. I think your second option is doable, by building some way to generate the containers as we go. It's not going to be optimal, but it should work. Think something like a container per bucket.

In general, I think this is mostly about setting up the repository config well. If that's done, then while adding virtual references it should be doable to match the right container. I'll think more about it.

paraseba commented 1 day ago

I realize I don't like storing the Config with the strategy we use for branches, because it requires a list + a get. Even if the first of those is concurrent with fetching the refs, or a snapshot, it's not ideal. We could use the new PUT if-match functionality in S3, but this means this would be the first place where we update files in place. More thought needed.

rabernat commented 1 day ago

There's a lot to unpack here. (Meta point: I'm finding it hard to "review" a GitHub issue and am missing a more Google-docs-like experience. Maybe for the next one of these, we just use a google doc? Or make the design doc itself a doc in the repo and review it via PR?)

Anyway, here are a few thoughts, not particularly organized


I'm very :-1: on creating a new mechanism for storing "config" outside of the existing snapshot system, particularly if it involves transactional updates. I would much rather create a new type of metadata file that is referenced from the snapshot. I'd like the entire state of the store to be self contained within a snapshot.


What if we created a manifest of every single virtual file that is referenced in the repo? How big could this be? it's hard to imagine more than, say 1_000_000 virtual files? But if so, we could split it into multiple manifests, as we do with chunks.

This would bring some big advantages.

And some disadvantages:

Basically I'm suggesting we normalize our schema and factor out virtual files into an entity of their own.

I have not fully thought through how this proposal interacts with the concept of a VirtualChunkContainer. I guess we need to decide on what we mean by "location" for data. For me, it's enough to know

This is what's needed to provide credentials. Everything else is part of the key and doesn't need to be exposed at this level.

paraseba commented 1 day ago

Moved this to a PR #436 to make it easier to iterate and get feedback