earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
262 stars 14 forks source link

Consider storing hashing functions for ranges of chunks instead of chunk references in metadata files #322

Open shoyer opened 2 weeks ago

shoyer commented 2 weeks ago
As I understand it, Icechunk current stores references to chunks with something like the following: Chunk Index Reference
0 3abc21
1 c14fe1
2 537c21

This will struggle to scale for very large datasets like those in ARCO-ERA5, which use 10s of millions of chunks.

Instead, what about storing records of writes with corresponding ranges of chunks, e.g., Chunk start Chunk stop Version
0 10000 v1
10000 11000 v2

Then raw chunks could be stored at /{version}/{chunk_id} or /{version}/{hash(chunk_id)}. This would keep metadata files quite small, so they could still be loaded into memory even for very large stores.

I think this would work well for the typical update patterns we encounter for weather/climate data, where we have O(1) writes per day, and which update contiguous ranges of chunks, e.g., "append one weather forecast" or "update the last few days of data with additional quality controls"

paraseba commented 1 week ago

@shoyer I really like this approach, and I'd encourage people to continue this line of thinking. But there are a few details I want to bring up.

The complex scenario on writes is when you have distributed writers, writing concurrently on multiple different sessions. This would be a scenario where two or more users start different distributed writes, using, for example, the same parent snapshot_id.

In this scenario, we cannot let the written chunks overwrite each other, each writter needs the guarantee that their chunks will be safe in object store before the commit starts. So, as a minimum, we would need to give each writer a "safe" prefix (or hash function, encoding, etc.) to put their chunks in. Then, at commit time, some of those chunks will need to be pointed from the manifest, and we don't initially know which writer writes which chunks, so we need to store it in the manifest.

One trivial way to do that would be to make the manifest a table like (I'm skipping over several details, like the node id, for clarity)

Chunk coord Writer id or prefix or hash
0/0/0 abcd
0/0/1 ef01
0/0/2 2345

That is in fact a small optimization, because writer id would neew fewer bytes than chunk id. But not a huge optimization, we still need to list chunks.

Other approach would be to try to create an efficient way to encode the list of chunks coordinates that a writer wrote. With that we would go from one row per chunk to one row per chunk range. This is trivial for certain patterns of distributed writes, but not for the general case.

I wanted to add two extra pieces of context.