Open paraseba opened 2 days ago
Object level change detection (not serving a virtual chunk if the containing file changed)
I was thinking about this yesterday and planning to raise an issue about it. I think it would make the whole virtual references idea a lot more robust if icechunk could flag whenever the last modified timestamp on an object was later than the commit timestamp, because there is basically no scenario in which that is correct.
Be able to know all the "places" pointed by a repo
This would also be very useful, else widespread use of virtual references could lead to really complicated compound virtual datasets...
For large prefixes, the template + arguments creates some savings. For example, a template for virtual zarr like some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{}/{}/{}.
FWIW VirtualiZarr could also do this for the in-memory manifest. But given that the in-memory size didn't seem to be a bottleneck in #401 then I don't think this is particularly important. The idea is also sort of related to https://github.com/zarr-developers/VirtualiZarr/issues/238#issuecomment-2386229261.
Having to declare the container before being able to write, is a bit inconvenient. An alternative would be to automatically create a container that includes only the bucket name as template.
I would very much like to keep the simplicity of vds.virtualize.to_icechunk(store)
in virtualizarr if we can. This is totally an implementation detail that users won't want to/shouldn't have to care about. In order, I would prefer:
Maybe (3) could be accompanied by a warning above some arbitrary threshold? e.g. "You are about to create 2GB of virtual references - you might want to optimize the on-disk size by explicitly choosing a suitable prefix"
Great feedback @TomNicholas, thank you.
if icechunk could flag whenever the last modified timestamp on an object was later than the commit timestamp
Yes, with these changes to the format we can implement an interface that sets the last-modified timestamp either to the write time, or to anything the user passes. So even by default, we can get this "don't pull changed data" behavior.
Just do it the inefficient way by default
OR auto-choose a performant prefix based on the current snapshot
OR Expose something extra to users.
The issue with doing nothing, is not as much the increase in manifest size, but requirement number 4. We want to know all the "places" we depend on, so we need to "extract" some form of prefix at least. I think your second option is doable, by building some way to generate the containers as we go. It's not going to be optimal, but it should work. Think something like a container per bucket.
In general, I think this is mostly about setting up the repository config well. If that's done, then while adding virtual references it should be doable to match the right container. I'll think more about it.
I realize I don't like storing the Config
with the strategy we use for branches, because it requires a list
+ a get
. Even if the first of those is concurrent with fetching the refs, or a snapshot, it's not ideal. We could use the new PUT if-match
functionality in S3, but this means this would be the first place where we update files in place. More thought needed.
There's a lot to unpack here. (Meta point: I'm finding it hard to "review" a GitHub issue and am missing a more Google-docs-like experience. Maybe for the next one of these, we just use a google doc? Or make the design doc itself a doc in the repo and review it via PR?)
Anyway, here are a few thoughts, not particularly organized
I'm very :-1: on creating a new mechanism for storing "config" outside of the existing snapshot system, particularly if it involves transactional updates. I would much rather create a new type of metadata file that is referenced from the snapshot. I'd like the entire state of the store to be self contained within a snapshot.
What if we created a manifest of every single virtual file that is referenced in the repo? How big could this be? it's hard to imagine more than, say 1_000_000 virtual files? But if so, we could split it into multiple manifests, as we do with chunks.
This would bring some big advantages.
pub struct VirtualChunkRef {
pub virtual_file_id: VirtualFileID,
pub offset: ChunkOffset,
pub length: ChunkLength,
}
And some disadvantages:
Basically I'm suggesting we normalize our schema and factor out virtual files into an entity of their own.
I have not fully thought through how this proposal interacts with the concept of a VirtualChunkContainer
. I guess we need to decide on what we mean by "location" for data. For me, it's enough to know
This is what's needed to provide credentials. Everything else is part of the key and doesn't need to be exposed at this level.
Moved this to a PR #436 to make it easier to iterate and get feedback
This doc has been moved to https://github.com/earth-mover/icechunk/pull/436
What we need
Discussion
Chunks url styles
s3://some-bucket/some-prefix/some-file.nc
is pointed to by 10k chunks.s3://some-bucket/some-prefix/c/0/0/1
is pointed to by only 1 chunk. But there are many objects like it.Prefixes
Storing the full URL for each chunk duplicates a lot of data in the manifest, making it large. The idea is to try to extract common prefixes and identify them with a short id. Then the manifest only needs to identify the prefix id + maybe an extra string + length + offset.
In the first url style above, the prefix could be the full file,
s3://some-bucket/some-prefix/some-file.nc
. In the second example, there are options for the prefix, it could bes3://some-bucket/some-prefix/c/0/0/1
, ors3://some-bucket/some-prefix/c/0/0
ors3://some-bucket/some-prefix/c/0
ors3://some-bucket/some-prefix/c
, with different numbers of chunks under it in each case.Intelligence is needed to optimize what is consider a good prefix. If done badly, we could have:
s3://some-bucket/some-prefix/c/0/0/1
in the second example aboves3://some-bucket
as the prefix each chunk needs to add something likesome-prefix/c/0/0
.Should prefixes be per snapshot or global?
The advantage of per snapshot is that we don't need to load/parse all the prefixes if the current snapshot uses only a few. A case for this would be a virtual dataset that has been materialized in a newer version. If prefixes are per-snapshot we wouldn't need to load any prefixes.
On the other hand, requirement 4 asks for the full list of "dependencies" for the dataset, and that means for all versions, which requires a global list of prefixes. Without a global list, we would need to fetch the list of prefixes of each snapshot.
Who creates/decides the prefixes
Obviously, generating the list of prefixes given the list of virtual chunks is doable automatically. But it's a tricky optimization problem. One strategy could be to generate a trie with all chunk urls, and then use some heuristic to select prefixes using heavily populated nodes. It's tricky and potentially computationally expensive for large repos.
An alternative is to let the user dictate the prefixes. This way, they can optimize them knowing beforehand what virtual chunks they will add. If they don't optimize, they get large manifests without prefix extraction. We still need to satisfy requirements 2 and 4, which would require automatic extraction of at least the bucket and platform.
Additionally, the optimization problem can be treated as an extra process (similar to compaction or garbage collection), and ran offline on demand.
Templates instead of prefixes
Using templates would be more efficient, example:
s3://some-bucket/some-prefix/1-analysis-output-for-model-abc.nc
s3://some-bucket/some-prefix/2-analysis-output-for-model-abc.nc
If we use the template
s3://some-bucket/some-prefix/{}-analysis-output-for-model-abc.nc
each chunk could have a couple of extra characters only, filling the placeholders in the template.Supporting multiple clouds
When the user adds a virtual ref, they need to indicate the details of the platform to retrieve it. These include what cloud it is (so we know how to set up credentials), region, endpoint url, port and other settings. It looks a lot like our
S3Config
type (or an expanded version of that).Supporting multiple credentials
Credentials are a read time concern. When the user writes the virtual chunk, they don't necessarily know what credentials set will be used to read it.
Credentials need to be specified later, when the chunks are going to be read. For example, by indicating in runtime configuration a set of credentials for each virtual chunk bucket/prefix/template.
Design
Explicit containers for virtual chunks
VirtualChunkContainers
Global config
Config
object. This object is serialized using a strategy similar to branches, where new files are created using put-if-not-exists. There is a single configuration per repo, and all reads and writes are done reading the latest configuration. In the initial implementation, the "history" of configuration is present only for reference and user inspection by hand.Config
object can act as a configuration default when users don't override. But different fields of configuration may have different behaviors.New virtual chunk ref
We change from
To
last_modified
field contains an optional unix epoch in seconds. If set for a chunk, retrieval will fail if the file was modified after this time. We use this instead of an ETag or checksum to save space on the manifests.u32
should be enough until year ~ 2100.Reading virtual chunks
VirtualChunkContainer
, pointing to them by name.VirtualChunkContainer
credentials, that will be used for any container not specifically setStorage
will be used (if platforms are the same)VirtualChunkContainer
is resolved from configuration.template_args
in the virtual chunks. If there are no enough template arguments or some are None, they are replaced by the default template arguments in the container. If there are too many, last ones are ignored.Writing virtual chunks
VirtualChunkContainers
.VirtualChunkRef
the index for the container and the template expansion arguments. An error is produced if the container is not found.Check requirements are satisfied
Now the virtual chunk containers inform what is the platform, and we can create the appropriate object store client at read time. These clients are cached for performance.
Now the user initializes a Repo instance by passing, if needed, a credential set for each virtual chunk container. If a chunk is read for which there are no credentials, we default to a common set of credentials for virtual chunks, and ultimately, to the Storage credentials if on the same platform.
For large prefixes, the template + arguments creates some savings. For example, a template for virtual zarr like
some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{}/{}/{}
. On the other hand, the template arguments vector has a 24 character overhead, on top of the values themselves. So it's not a lot of savings for smaller prefixes.This can now be retrieved directly from the repository config. It's basically the set of virtual chunk containers (or a subset at least)
We optionally store the last-modified timestamp of the object containing each chunk. This is not optimal, a hash would be better, but it takes more space. Also not optimal, storing it per chunk, when many chunks come from the same object, but we don't have other easy place to store it if they are using containers that include more than one object. We could add more configuration to optimize this in the future.