LDeakin / zarrs

A rust library for the Zarr storage format for multidimensional arrays and metadata
Apache License 2.0
89 stars 8 forks source link

Variable length data types #40

Closed LDeakin closed 1 month ago

LDeakin commented 2 months ago

Resolves #21.

This is a substantial change that adds support for variable length data types to zarrs. There were some breaking changes necessary to support this:

Data types

Codecs

vlen

{
  "name": "vlen",
  "configuration": {
    "data_codecs": [{"name": "bytes"},{"name": "blosc","configuration": {"cname": "zstd", "clevel":5,"shuffle": "bitshuffle", "typesize":1,"blocksize":0}}],
    "index_codecs": [{"name": "bytes","configuration": { "endian": "little" }},{"name": "blosc","configuration":{"cname": "zstd", "clevel":5,"shuffle": "shuffle", "typesize":4,"blocksize":0}}],
    "index_data_type": "uint32"
  }
}

Based on https://github.com/zarr-developers/zeps/pull/47#issuecomment-1710505141.

Structure:

The encoded index size is necessary to support index compression and partial decoding. If this were not available, the index could not used a bytes-to-bytes compression codec. A bytes-to-bytes compression codec could follow vlen, but then "data" is potentially running through a compression codec twice.

vlen_v2

{
  "name": "vlen_v2"
}

This matches Zarr V2 style interleaved encoding, which is implemented by numcodecs vlen-utf8, vlen-bytes, and vlen-array. These are all essentially the same codec, with data type-dependent behaviour. It makes sense to standardise a single codec for Zarr V3 to support Zarr V2 vlen-utf8/bytes/array encoded data without reencoding chunks.

Encoding Efficiency (32-bit index)

Sum of chunk sizes (in bytes) on "city" column of https://github.com/zarr-developers/zarr-python/pull/2036#issuecomment-2227440951.

https://github.com/LDeakin/zarrs/blob/variable_length_data_types/tests/cities.rs.

encoding compression size
vlen_v2 642196
vlen_v2 zstd 5 362626
vlen 642580
vlen zstd 5 346950
codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 85.78135% with 424 lines in your changes missing coverage. Please review.

Project coverage is 81.33%. Comparing base (d54b89d) to head (71b9d78).

Files Patch % Lines
src/array/element.rs 65.41% 46 Missing :warning:
...rray_to_bytes/sharding/sharding_partial_decoder.rs 88.28% 39 Missing :warning:
src/array/array_sync_sharded_readable_ext.rs 63.95% 31 Missing :warning:
...en_interleaved/vlen_interleaved_partial_decoder.rs 53.03% 31 Missing :warning:
src/array/codec/array_to_bytes/vlen.rs 71.13% 28 Missing :warning:
src/array/array_bytes.rs 93.63% 25 Missing :warning:
...o_bytes/vlen_interleaved/vlen_interleaved_codec.rs 78.72% 20 Missing :warning:
src/array/array_representation.rs 56.75% 16 Missing :warning:
src/array/codec/array_to_bytes/vlen_interleaved.rs 68.00% 16 Missing :warning:
src/array/array_async_readable_writable.rs 68.08% 15 Missing :warning:
... and 27 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #40 +/- ## ========================================== + Coverage 79.56% 81.33% +1.76% ========================================== Files 142 152 +10 Lines 19544 20837 +1293 ========================================== + Hits 15550 16947 +1397 + Misses 3994 3890 -104 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.