Closed LDeakin closed 1 month ago
Attention: Patch coverage is 85.78135%
with 424 lines
in your changes missing coverage. Please review.
Project coverage is 81.33%. Comparing base (
d54b89d
) to head (71b9d78
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Resolves #21.
This is a substantial change that adds support for variable length data types to
zarrs
. There were some breaking changes necessary to support this:ArrayBytes
which can represent fixed or variable length bytes, rather than just a slice-likeElement[Owned]
traits, with better validationRawBytes
ArrayBytes
andRawBytes
Data types
Codecs
vlen
Based on https://github.com/zarr-developers/zeps/pull/47#issuecomment-1710505141.
Structure:
uint64
representing the size in bytes of the encoded index,index_codecs
,data_codecs
.The encoded index size is necessary to support index compression and partial decoding. If this were not available, the index could not used a bytes-to-bytes compression codec. A bytes-to-bytes compression codec could follow
vlen
, but then "data" is potentially running through a compression codec twice.vlen_v2
This matches Zarr V2 style interleaved encoding, which is implemented by numcodecs
vlen-utf8
,vlen-bytes
, andvlen-array
. These are all essentially the same codec, with data type-dependent behaviour. It makes sense to standardise a single codec for Zarr V3 to support Zarr V2vlen-utf8/bytes/array
encoded data without reencoding chunks.Encoding Efficiency (32-bit index)
Sum of chunk sizes (in bytes) on "city" column of https://github.com/zarr-developers/zarr-python/pull/2036#issuecomment-2227440951.
https://github.com/LDeakin/zarrs/blob/variable_length_data_types/tests/cities.rs.