Variable-length chunks - Githubissues

TomNicholas commented 1 week ago

I would like to be able to use zarr + virtualizarr + icechunk with variable-length chunks - see https://github.com/zarr-developers/zarr-specs/issues/138.

I'm thinking about what changes in the stack would be required to get this to work - definitely in zarr-python, but also presumably icechunk needs to be able to return chunks of variable size, and the icechunk spec has to accommodate that generality? Then if it's in the icechunk spec does it also need to be in the zarr spec too?

cc @abarciauskas-bgse @sharkinsspatial

rabernat commented 1 week ago

I agree that this is a super important topic that we should figure out how to solve. It has been discussed for a long time, and there are various prototypes and ideas out there, but it hasn't really advanced.

I'd encourage us to start from first principles and actually design variable chunking from the ground up, starting not from the various existing specs and tools but from a set of requirements and first principles. Then we can thank about the right way to implement it.

For example, the existing spec conversations about this stalled on the question of scale. Is it feasible to store the chunk sizes in the metadata? That depends...how many chunks are we expected to store? It's fine for 100 chunks. Probably not for 100_000_000. Does the solution need to scale to accommodate arbitrarily large arrays? What tradeoffs are we willing to accept? E.g. can we accept increased latency in exchange for variable length chunks? What about writing? What's the process for updating existing variable-length-chunked datasets? There are many, many more questions we could ask. (@paraseba is very good at enumerating these types of design questions.)

In summary, what I think is needed to move forward is a set of meticulously documented use cases and technical requirements.

TomNicholas commented 1 week ago

Does the solution need to scale to accommodate arbitrarily large arrays?

Yeah this is the key question, from which everything else should follow.

In summary, what I think is needed to move forward is a set of meticulously documented use cases and technical requirements.

I'll start: There's one here that's representative of the output of many HPC fluid simulation codes: https://github.com/zarr-developers/VirtualiZarr/issues/217

dcherian commented 1 week ago

Another extremely common one is virtual references to netCDF files, which have daily frequency data and a single month of data in a single netCDF file.

abarciauskas-bgse commented 1 week ago

@dcherian I've heard this example before but not sure what datasets are structured this way - can you provide an example?

rabernat commented 1 week ago

can you provide an example?

Most of these files fit the bill: https://nsf-ncar-era5.s3.amazonaws.com/index.html#e5.oper.an.sfc/

The files are partitioned by month and chunked in such a way that the time-concatenated chunks would be uneven.

earth-mover / icechunk

Variable-length chunks #386