JuliaIO / Zarr.jl

Other
119 stars 23 forks source link

refactor to use transcodingstreams.jl API ? #135

Open bjarthur opened 7 months ago

bjarthur commented 7 months ago

i see in src/compressors.jl that support for zlib exists. have you considered refactoring zarr.jl to access this compressor via the transcodingstream API instead? then you could get lz4, zstd, etc. too for free.

i ask, because blosc1.jl is not thread safe (in my hands anyway), and i'd like to read from multiple zarr chunks simultaneously. codecZstd.jl is thread safe i believe, and i'd like to switch my zarr data set to use it. i'm not sure about codecZlib.jl.

note that currently transcodingstreams.jl does not support blosc, however there is an issue suggesting it do so, with code in a link: https://github.com/JuliaIO/Blosc.jl/issues/79. so zarr.jl wouldn't have to drop support for blosc1 were that issue followed through.

i actually attempted to add zstd myself to zarr.jl this morning through a package extension directly using codecZstd. first started refactoring the existing blosc and zlib code to use extensions, but then realized that structs defined in extension modules cannot be exported. so the user would not have access to e.g. BloscCompressor. but i think a refactoring to use transcodingstreams.jl wouldn't even need an extension. i'm filing this issue because i want to get a zarr.jl dev's opinion before proceeding further.

nhz2 commented 7 months ago

The underlying blosc c library can be used in a thread safe way using the blosc_compress_ctx and blosc_decompress_ctx functions. That's what I am using in my experimental zarr implementation. https://github.com/medyan-dev/SmallZarrGroups.jl/blob/b68316868caa543c3170d64d9828a072989acd8f/src/compression.jl#L19

meggart commented 6 months ago

see in src/compressors.jl that support for zlib exists. have you considered refactoring zarr.jl to access this compressor via the transcodingstream API instead? then you could get lz4, zstd, etc. too for free.

I have considered this in the very beginning, but as you mention the main road blocker for this was the missing blosc support and blosc was what was used by almost all zarr dataset I was working with. A re-factoring towards transcodingstream api would be great and could lead to speedups as we could already start decompressing while the data is downloaded. This should be made easier now that we rely on Channels to pass data down the chain from storage -> decompression -> dump into output.

Regarding using blosc1 context api: I have experimented with this myself but never got this working in multiple threads. For some reason I was still running into frequent segmentation faults despite being sure to only use ctx versions of the c api. Probably I was doing something awfully wrong and it would be interesting to look at the implementation in SmallZarrGroups.jl