Closed milankl closed 2 days ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 86.94%. Comparing base (
ccd60eb
) to head (761217e
). Report is 8 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I'm not an expert on this package, but you seem to be correct. https://github.com/JuliaIO/JLD2.jl/blob/ccd60ebc7ea68d0461f062fc3b1a639df28e82e9/src/compression.jl#L160
Manual test:
julia> using JLD2, CodecZstd
julia> A = zeros(1000,1000);
julia> A[1] = rand()
0.44335952762563824
julia> sizeof(A)/1000^2 # 8 MB array
8.0
julia> save("test_without_compression.jld2", "A", A)
julia> save("test_with_compression.jld2", "A", A, compress=ZstdCompressor())
julia> A == load("test_with_compression.jld2", "A")
true
julia> A == load("test_without_compression.jld2", "A")
true
and I have two files on disk then
-rw-r--r--@ 1 milan staff 1.1K 13 May 12:59 test_with_compression.jld2
-rw-r--r--@ 1 milan staff 7.6M 13 May 12:58 test_without_compression.jld2
one uncompressed of about 8MB and one compressed to 1KB
You should test if you can load the file via HDF5.jl.
using HDF5
using H5Zzstd
h5open("test_with_compression.jld2") do h5f
h5f["A"][]
end
Okay that's not working correctly
julia> h5open("test_with_compression.jld2") do h5f
h5f["A"][]
end
1000×1000 Matrix{Float64}:
0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7.41892e-310 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.95e-321 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7.41892e-310 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.42e-321 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0
NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7.41892e-310 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3.6e-322 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
⋮ ⋱ ⋮
0.0 4.0e-323 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 5.17085e170 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 9.56977e-315 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.11254e-308 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.11255e-308 0.0 0.0 0.0 0.0 0.0 0.0 0.0
(it should be A[1] == 0.4436 otherwise zero), even though the file without zstd compression is loaded correctly. Why is that?
edit: The same works perfectly fine with compress=ZlibCompressor()
, but I'm also confused the decompressed matrix is different every time it's decompressed...
So the problem is that we are using ZSTD_compressStream
and ZSTD_decompressStream
here. While the upstream reference uses ZSTD_compress
and ZSTD_decompress
.
https://github.com/HDFGroup/hdf5_plugins/blob/master/ZSTD/src/H5Zzstd.c
OK, I'm going to start working on a ZstdFrameCompressor / ZstdFrameDecompressor . I'm not sure if that is the final name yet, but it's basically the NotStream compressor / decompressor.
I'm not completely sure how to make that work in the TranscodingStream API, but we'll see.
Awesome, let me know if I can help!
@mkitti I've written already the tests for CodecZstd
here using ZstdFrameCompressor
, however, they are commented until https://github.com/JuliaIO/CodecZstd.jl/pull/46 is merged and released
I've set the compressor here from ZstdFrameCompressor
to ZstdCompressor
, but we need to release a new version of CodecZstd.jl to actually test this here properly: https://github.com/JuliaIO/CodecZstd.jl/pull/52#issuecomment-2140406751
I think it should still be ZstdFrameCompressor
. Sorry for the confusion. I'll will work on a release tonight.
Note that CodecZstd.jl contains ZstdFrameCompressor
, so there is no dependency block here now.
Hi, what is the status on this?
Locally, a round trip of ZstdFrameCompressor
and ZstdDecompressor
works just fine within JLD2
but not via HDF5.
e.g.
julia> using JLD2, CodecZstd
julia> A = zeros(1000,1000);
julia> A[1] = rand()
0.44335952762563824
julia> sizeof(A)/1000^2 # 8 MB array
8.0
julia> save("test_without_compression.jld2", "A", A)
julia> save("test_with_compression.jld2", "A", A, compress=ZstdFrameCompressor())
julia> A == load("test_with_compression.jld2", "A")
true
julia> A == load("test_without_compression.jld2", "A")
true
with h5dump
HDF5 "test_with_compression.jld2" {
GROUP "/" {
DATASET "A" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 1000, 1000 ) / ( 1000, 1000 ) }
DATA {h5dump error: unable to print data
}
}
}
}
or
julia> h5open("test_with_compression.jld2") do h5f
h5f["A"][]
end
┌ Error: Exception while generating log record in module H5Zzstd at /home/jisensee/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
│ exception =
│ UndefVarError: `err` not defined
│ Stacktrace:
│ [1] logging_error(logger::Any, level::Any, _module::Any, group::Any, id::Any, filepath::Any, line::Any, err::Any, real::Bool)
│ @ Base.CoreLogging ./logging.jl:478
│ [2] #invokelatest#2
│ @ ./essentials.jl:892 [inlined]
│ [3] invokelatest
│ @ ./essentials.jl:889 [inlined]
│ [4] macro expansion
│ @ ./logging.jl:364 [inlined]
│ [5] H5Z_filter_zstd(flags::UInt32, cd_nelmts::UInt64, cd_values::Ptr{UInt32}, nbytes::UInt64, buf_size::Ptr{UInt64}, buf::Ptr{Ptr{Nothing}})
│ @ H5Zzstd ~/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
│ [6] h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{Float64})
│ @ HDF5.API ~/.julia/packages/HDF5/Z859u/src/api/functions.jl:796
│ [7] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
│ @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:180
│ [8] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
│ @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
│ [9] getindex(::HDF5.Dataset)
│ @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
│ [10] (::var"#5#6")(h5f::HDF5.File)
│ @ Main ./REPL[6]:2
│ [11] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
│ @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
│ [12] task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File}, key::Symbol, val::HDF5.HDF5Context)
│ @ Base ./task.jl:297
│ [13] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
│ @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
│ [14] h5open(f::Function, args::String)
│ @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
│ [15] top-level scope
│ @ REPL[6]:1
│ [16] eval
│ @ ./boot.jl:385 [inlined]
│ [17] eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
│ @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150
│ [18] repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
│ @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
│ [19] start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
│ @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
│ [20] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any)
│ @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
│ [21] run_repl(repl::REPL.AbstractREPL, consumer::Any)
│ @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
│ [22] (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module)
│ @ Base ./client.jl:432
│ [23] #invokelatest#2
│ @ ./essentials.jl:892 [inlined]
│ [24] invokelatest
│ @ ./essentials.jl:889 [inlined]
│ [25] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
│ @ Base ./client.jl:416
│ [26] exec_options(opts::Base.JLOptions)
│ @ Base ./client.jl:333
│ [27] _start()
│ @ Base ./client.jl:552
└ @ H5Zzstd ~/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
24-element Vector{Base.StackTraces.StackFrame}:
H5Z_filter_zstd(flags::UInt32, cd_nelmts::UInt64, cd_values::Ptr{UInt32}, nbytes::UInt64, buf_size::Ptr{UInt64}, buf::Ptr{Ptr{Nothing}}) at H5Zzstd.jl:40
h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{Float64}) at functions.jl:796
_generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing) at readwrite.jl:180
generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}) at readwrite.jl:146
getindex(::HDF5.Dataset) at readwrite.jl:59
(::var"#5#6")(h5f::HDF5.File) at REPL[6]:2
(::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, var"#5#6", HDF5.File})() at file.jl:101
task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, var"#5#6", HDF5.File}, key::Symbol, val::HDF5.HDF5Context) at task.jl:297
h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}) at file.jl:96
h5open(f::Function, args::String) at file.jl:94
top-level scope at REPL[6]:1
eval at boot.jl:385 [inlined]
eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module) at REPL.jl:150
repl_backend_loop(backend::REPL.REPLBackend, get_module::Function) at REPL.jl:246
start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function) at REPL.jl:231
kwcall(::NamedTuple, ::typeof(REPL.start_repl_backend), backend::REPL.REPLBackend, consumer::Any) at REPL.jl:228
run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any) at REPL.jl:389
run_repl(repl::REPL.AbstractREPL, consumer::Any) at REPL.jl:375
(::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module) at client.jl:432
#invokelatest#2 at essentials.jl:892 [inlined]
invokelatest at essentials.jl:889 [inlined]
run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool) at client.jl:416
exec_options(opts::Base.JLOptions) at client.jl:333
_start() at client.jl:552
ERROR: HDF5.API.H5Error: Error reading dataset /A
libhdf5 Stacktrace:
[1] H5Z_pipeline: Data filters/Read failed
filter returned failure during read
⋮
Stacktrace:
[1] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:209
[2] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
[3] getindex(::HDF5.Dataset)
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
[4] (::var"#5#6")(h5f::HDF5.File)
@ Main ./REPL[6]:2
[5] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
[6] task_local_storage(body::HDF5.var"#17#18"{…}, key::Symbol, val::HDF5.HDF5Context)
@ Base ./task.jl:297
[7] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
[8] h5open(f::Function, args::String)
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
[9] top-level scope
@ REPL[6]:1
caused by: HDF5.API.H5Error: Error reading dataset /A
libhdf5 Stacktrace:
[1] H5Z_pipeline: Data filters/Read failed
filter returned failure during read
⋮
Stacktrace:
[1] macro expansion
@ ~/.julia/packages/HDF5/Z859u/src/api/error.jl:18 [inlined]
[2] h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{…})
@ HDF5.API ~/.julia/packages/HDF5/Z859u/src/api/functions.jl:800
[3] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:180
[4] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
[5] getindex(::HDF5.Dataset)
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
[6] (::var"#5#6")(h5f::HDF5.File)
@ Main ./REPL[6]:2
[7] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
[8] task_local_storage(body::HDF5.var"#17#18"{…}, key::Symbol, val::HDF5.HDF5Context)
@ Base ./task.jl:297
[9] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
[10] h5open(f::Function, args::String)
@ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
[11] top-level scope
@ REPL[6]:1
Some type information was truncated. Use `show(err)` to see complete types.
This should work now:
julia> using H5Zzstd, HDF5
julia> h5open("test_with_compression.jld2") do h5f
h5f["A"][]
end
julia> using HDF5_jll
julia> run(`$(HDF5_jll.h5ls()) -va test_with_compression.jld2`);
Opened "test_with_compression.jld2" with sec2 driver.
A Dataset {1000/1000, 1000/1000}
Location: 1:48
Links: 1
Chunks: {1000, 1000} 8000000 bytes
Storage: 8000000 logical bytes, 283 allocated bytes, 2826855.12% utilization
Filter-0: ZSTD-32015 {5}
Type: native double
Address: 189
Flags Bytes Address Logical Offset
========== ======== ========== ==============================
0x00000000 283 189 [0, 0, 0]
This should work now:
julia> using H5Zzstd, HDF5 julia> h5open("test_with_compression.jld2") do h5f h5f["A"][] end julia> using HDF5_jll julia> run(`$(HDF5_jll.h5ls()) -va test_with_compression.jld2`); Opened "test_with_compression.jld2" with sec2 driver. A Dataset {1000/1000, 1000/1000} Location: 1:48 Links: 1 Chunks: {1000, 1000} 8000000 bytes Storage: 8000000 logical bytes, 283 allocated bytes, 2826855.12% utilization Filter-0: ZSTD-32015 {5} Type: native double Address: 189 Flags Bytes Address Logical Offset ========== ======== ========== ============================== 0x00000000 283 189 [0, 0, 0]
Thank you! I can reproduce locally.
Thanks for your work :)
@mkitti you want to include a compatibility constrain with the CodecZstd v0.8.3? Probably not hence you suggested the extension? Otherwise this seems to be ready to be merged?
Can this even be done, given that CodecZstd isn't even a dependency or weak dep?
Can this even be done, given that CodecZstd isn't even a dependency or weal dep?
riiiiiight, forgot that. Any reason then not to merge this?
To be honest, I do not really understand how the compressors are being loaded here.
I think that we should declare CodecZlib, CodecBzip2, and now CodecZstd as a weak dependencies with compat ranges. Eventually, it would be good to consider how to take advantage of package extensions.
Over in https://github.com/JuliaIO/HDF5.jl/pull/1160, I am working on converting all the HDF5 filter packages into package extensions.
To be honest, I do not really understand how the compressors are being loaded here.
Compressor loading works the way it was done in FileIO for loading backends a few years ago. I'm not sure this has changed.
The logic finds out which decompressor is needed for loading and checks if the library is loaded. If not, it checks if it is installed - in which case it then tries to load it.
Pkg extensions would not help here, since there are zero lines of code specific to the individual backends.
Pkg extensions would not help here, since there are zero lines of code specific to the individual backends.
That's not very clear to me. Right now the loading mechanism ends up being very dynamic. I think this could be studied further.
I think, this is a discussion for another day. I'll merge this for now.
fixes #357
@mkitti I'm not sure what the 4-element tuple in
ID_TO_COMPRESSOR
is supposed to be? Package name, compressor name, decompressor name, some short name in caps?