Open shoyer opened 2 weeks ago
@shoyer I really like this approach, and I'd encourage people to continue this line of thinking. But there are a few details I want to bring up.
The complex scenario on writes is when you have distributed writers, writing concurrently on multiple different sessions. This would be a scenario where two or more users start different distributed writes, using, for example, the same parent snapshot_id.
In this scenario, we cannot let the written chunks overwrite each other, each writter needs the guarantee that their chunks will be safe in object store before the commit starts. So, as a minimum, we would need to give each writer a "safe" prefix (or hash function, encoding, etc.) to put their chunks in. Then, at commit time, some of those chunks will need to be pointed from the manifest, and we don't initially know which writer writes which chunks, so we need to store it in the manifest.
One trivial way to do that would be to make the manifest a table like (I'm skipping over several details, like the node id, for clarity)
Chunk coord | Writer id or prefix or hash |
---|---|
0/0/0 | abcd |
0/0/1 | ef01 |
0/0/2 | 2345 |
That is in fact a small optimization, because writer id would neew fewer bytes than chunk id. But not a huge optimization, we still need to list chunks.
Other approach would be to try to create an efficient way to encode the list of chunks coordinates that a writer wrote. With that we would go from one row per chunk to one row per chunk range. This is trivial for certain patterns of distributed writes, but not for the general case.
I wanted to add two extra pieces of context.
This will struggle to scale for very large datasets like those in ARCO-ERA5, which use 10s of millions of chunks.
Then raw chunks could be stored at
/{version}/{chunk_id}
or/{version}/{hash(chunk_id)}
. This would keep metadata files quite small, so they could still be loaded into memory even for very large stores.I think this would work well for the typical update patterns we encounter for weather/climate data, where we have O(1) writes per day, and which update contiguous ranges of chunks, e.g., "append one weather forecast" or "update the last few days of data with additional quality controls"