earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
295 stars 18 forks source link

Rust API: `Store` lacks a method for querying the size of values #277

Open LDeakin opened 1 month ago

LDeakin commented 1 month ago

As far as I can see, retrieving trailing bytes (e.g. CRC32C checksum, shard index) from a chunk with Store::get or Store::get_partial_values is not possible (with the ByteRange abstraction) without knowing the size of a value.

rabernat commented 1 month ago

Thanks for the feedback @LDeakin! I assume you're interested in sharding with this. Are there other use cases?

Icechunk has the ability to do sharding in a different way--by packing multiple chunks into the same object, but without Zarr really even knowing about it. This is also potentially more flexible, because the store can decide at runtime how to pack the chunks, or they can be repacked retroactively. I'm curious about the tradeoffs between this (currently unimplemented) approach to sharding and the current Zarr spec one.

TBH I have never really understood the whole "sharding as a codec" concept. I think it makes sense for sharding to be an implementation detail of the store.

As for chunk-level metadata like checkksum, with Icechunk we have the option of putting that in the chunk manifest rather than the chunk itself! This could be a lot more efficient to query.

LDeakin commented 1 month ago

Icechunk has the ability to do sharding in a different way--by packing multiple chunks into the same object, but without Zarr really even knowing about it

When I first scanned over Icechunk, I wondered how it would work with a shard written incrementally (chunk-by-chunk). But that sounds much better. Delegating sharding-like functionality to Icechunk could give history at chunk granularity, and array producers/consumers would not need to concern themselves with shards 👍.

I assume you're interested in sharding with this. Are there other use cases?

Not currently. But, a Zarr store either needs to support reading from the end of a value or querying its size (or ideally both) to support partial decoding with all current Zarr V3 codecs.

rabernat commented 1 month ago

Size definitely can and should be implemented! It's already in the chunk manifest.

paraseba commented 1 month ago

@LDeakin what's the issue with ByteRange(Bound::Included(42), Bound::Unbounded)

LDeakin commented 1 month ago

@LDeakin what's the issue with ByteRange(Bound::Included(42), Bound::Unbounded)

That represents the 42nd byte onwards right? What I am after is the last 42 bytes, for example. I think I would need to know the size of the value to construct such a ByteRange with the current implementation.

Note that many stores support requesting the last N bytes from an object. object_store supports it: object_store::GetRange::Suffix.

paraseba commented 1 month ago

@LDeakin I'll change ByteRange to allow this type of query. Thank you for flagging this!

paraseba commented 1 month ago

@LDeakin please take a look at https://github.com/earth-mover/icechunk/pull/285 . Hopefully you can use that, and we'll introduce access to the chunk size in a separate PR.

LDeakin commented 1 month ago

Looks good!

paraseba commented 1 month ago

We have given Lachlan away around this, but I'll keep the ticket open until we offer a way to retrieve the size of a chunk using the Store interface.

paraseba commented 1 month ago

@LDeakin we have released 0.1.0-alpha.3 with this change and the new list_dir_items method. Hope it helps.

LDeakin commented 1 month ago

@LDeakin we have released 0.1.0-alpha.3 with this change and the new list_dir_items method. Hope it helps.

Sure did! zarrs now supports icechunk stores: https://crates.io/crates/zarrs_icechunk

paraseba commented 1 month ago

Unbelievable @LDeakin !

paraseba commented 1 month ago

Related conversation in zarr-python https://github.com/zarr-developers/zarr-python/issues/2420