Open LDeakin opened 1 month ago
Thanks for the feedback @LDeakin! I assume you're interested in sharding with this. Are there other use cases?
Icechunk has the ability to do sharding in a different way--by packing multiple chunks into the same object, but without Zarr really even knowing about it. This is also potentially more flexible, because the store can decide at runtime how to pack the chunks, or they can be repacked retroactively. I'm curious about the tradeoffs between this (currently unimplemented) approach to sharding and the current Zarr spec one.
TBH I have never really understood the whole "sharding as a codec" concept. I think it makes sense for sharding to be an implementation detail of the store.
As for chunk-level metadata like checkksum, with Icechunk we have the option of putting that in the chunk manifest rather than the chunk itself! This could be a lot more efficient to query.
Icechunk has the ability to do sharding in a different way--by packing multiple chunks into the same object, but without Zarr really even knowing about it
When I first scanned over Icechunk, I wondered how it would work with a shard written incrementally (chunk-by-chunk). But that sounds much better. Delegating sharding-like functionality to Icechunk could give history at chunk granularity, and array producers/consumers would not need to concern themselves with shards 👍.
I assume you're interested in sharding with this. Are there other use cases?
Not currently. But, a Zarr store either needs to support reading from the end of a value or querying its size (or ideally both) to support partial decoding with all current Zarr V3 codecs.
Size definitely can and should be implemented! It's already in the chunk manifest.
@LDeakin what's the issue with ByteRange(Bound::Included(42), Bound::Unbounded)
@LDeakin what's the issue with
ByteRange(Bound::Included(42), Bound::Unbounded)
That represents the 42nd byte onwards right? What I am after is the last 42 bytes, for example. I think I would need to know the size of the value to construct such a ByteRange
with the current implementation.
Note that many stores support requesting the last N bytes from an object. object_store
supports it: object_store::GetRange::Suffix.
@LDeakin I'll change ByteRange
to allow this type of query. Thank you for flagging this!
@LDeakin please take a look at https://github.com/earth-mover/icechunk/pull/285 . Hopefully you can use that, and we'll introduce access to the chunk size in a separate PR.
Looks good!
We have given Lachlan away around this, but I'll keep the ticket open until we offer a way to retrieve the size of a chunk using the Store
interface.
@LDeakin we have released 0.1.0-alpha.3
with this change and the new list_dir_items
method. Hope it helps.
@LDeakin we have released
0.1.0-alpha.3
with this change and the newlist_dir_items
method. Hope it helps.
Sure did! zarrs
now supports icechunk
stores: https://crates.io/crates/zarrs_icechunk
Unbelievable @LDeakin !
Related conversation in zarr-python https://github.com/zarr-developers/zarr-python/issues/2420
As far as I can see, retrieving trailing bytes (e.g. CRC32C checksum, shard index) from a chunk with
Store::get
orStore::get_partial_values
is not possible (with theByteRange
abstraction) without knowing the size of a value.