Consider storing hashing functions for ranges of chunks instead of chunk references in metadata files

As I understand it, Icechunk current stores references to chunks with something like the following:	Chunk Index	Reference
0	3abc21
1	c14fe1
2	537c21

This will struggle to scale for very large datasets like those in ARCO-ERA5, which use 10s of millions of chunks.

Instead, what about storing records of writes with corresponding ranges of chunks, e.g.,	Chunk start	Chunk stop	Version
0	10000	v1
10000	11000	v2

Then raw chunks could be stored at /{version}/{chunk_id} or /{version}/{hash(chunk_id)}. This would keep metadata files quite small, so they could still be loaded into memory even for very large stores.

I think this would work well for the typical update patterns we encounter for weather/climate data, where we have O(1) writes per day, and which update contiguous ranges of chunks, e.g., "append one weather forecast" or "update the last few days of data with additional quality controls"

@shoyer I really like this approach, and I'd encourage people to continue this line of thinking. But there are a few details I want to bring up.

The complex scenario on writes is when you have distributed writers, writing concurrently on multiple different sessions. This would be a scenario where two or more users start different distributed writes, using, for example, the same parent snapshot_id.

In this scenario, we cannot let the written chunks overwrite each other, each writter needs the guarantee that their chunks will be safe in object store before the commit starts. So, as a minimum, we would need to give each writer a "safe" prefix (or hash function, encoding, etc.) to put their chunks in. Then, at commit time, some of those chunks will need to be pointed from the manifest, and we don't initially know which writer writes which chunks, so we need to store it in the manifest.

One trivial way to do that would be to make the manifest a table like (I'm skipping over several details, like the node id, for clarity)

Chunk coord	Writer id or prefix or hash
0/0/0	abcd
0/0/1	ef01
0/0/2	2345

That is in fact a small optimization, because writer id would neew fewer bytes than chunk id. But not a huge optimization, we still need to list chunks.

Other approach would be to try to create an efficient way to encode the list of chunks coordinates that a writer wrote. With that we would go from one row per chunk to one row per chunk range. This is trivial for certain patterns of distributed writes, but not for the general case.

I wanted to add two extra pieces of context.

Each chunk id is 12 bytes, and we'll have some padding for alignment, and some inefficiencies, we have done no work to optimize this yet. Let's say 20 bytes. That's pretty small, a few hundred million chunks would fit in memory comfortably.
But, more relevant to your append weather data use case:
- All the chunks for an array don't need to fit in a single manifest, we will split them
- As you write new data at higher values of the time dimension, those chunks will go to new manifests as the previous ones grow beyond certain limit.
- When you read data, if you read recent times, as is usually the case, you don't need to fetch and load every manifest, only the ones that contain chunks you need.
- Also, we can optimize offline how the chunks are distributed across manifests. This can be done based on many different inputs:
  - You could think there is a process that searches clusters of initialized chunks and gathers them in the same manifest. This way, geographically constrained queries would need fewer manifests.
  - We could do the same on the time dimensions, or in time and space.
  - We could also optimize based on query statistics, and gather together in a manifest chunks that are usually queried together.

earth-mover / icechunk

Consider storing hashing functions for ranges of chunks instead of chunk references in metadata files #322