JuliaIO / JLD2.jl

HDF5-compatible file format in pure Julia
Other
534 stars 82 forks source link

Add Zstd compression #560

Closed milankl closed 2 days ago

milankl commented 1 month ago

fixes #357

@mkitti I'm not sure what the 4-element tuple in ID_TO_COMPRESSOR is supposed to be? Package name, compressor name, decompressor name, some short name in caps?

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 86.94%. Comparing base (ccd60eb) to head (761217e). Report is 8 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #560 +/- ## ======================================= Coverage 86.94% 86.94% ======================================= Files 31 31 Lines 4311 4321 +10 ======================================= + Hits 3748 3757 +9 - Misses 563 564 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mkitti commented 1 month ago

I'm not an expert on this package, but you seem to be correct. https://github.com/JuliaIO/JLD2.jl/blob/ccd60ebc7ea68d0461f062fc3b1a639df28e82e9/src/compression.jl#L160

milankl commented 1 month ago

Manual test:

julia> using JLD2, CodecZstd

julia> A = zeros(1000,1000);

julia> A[1] = rand()
0.44335952762563824

julia> sizeof(A)/1000^2   # 8 MB array
8.0

julia> save("test_without_compression.jld2", "A", A)

julia> save("test_with_compression.jld2", "A", A, compress=ZstdCompressor())

julia> A == load("test_with_compression.jld2", "A")
true

julia> A == load("test_without_compression.jld2", "A")
true

and I have two files on disk then

-rw-r--r--@   1 milan  staff   1.1K 13 May 12:59 test_with_compression.jld2
-rw-r--r--@   1 milan  staff   7.6M 13 May 12:58 test_without_compression.jld2

one uncompressed of about 8MB and one compressed to 1KB

mkitti commented 1 month ago

You should test if you can load the file via HDF5.jl.

using HDF5
using H5Zzstd

h5open("test_with_compression.jld2") do h5f
   h5f["A"][]
end
milankl commented 1 month ago

Okay that's not working correctly

julia> h5open("test_with_compression.jld2") do h5f
          h5f["A"][]
       end
1000×1000 Matrix{Float64}:
   0.0           0.0           …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   7.41892e-310  0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   1.95e-321     0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   7.41892e-310  0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   1.42e-321     0.0           …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 NaN             0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
 NaN             0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   7.41892e-310  0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   3.6e-322      0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   ⋮                           ⋱            ⋮
   0.0           4.0e-323      …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           5.17085e170      0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           9.56977e-315  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           1.11254e-308     0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           1.11255e-308     0.0  0.0  0.0  0.0  0.0  0.0  0.0

(it should be A[1] == 0.4436 otherwise zero), even though the file without zstd compression is loaded correctly. Why is that?

edit: The same works perfectly fine with compress=ZlibCompressor(), but I'm also confused the decompressed matrix is different every time it's decompressed...

mkitti commented 1 month ago

So the problem is that we are using ZSTD_compressStream and ZSTD_decompressStream here. While the upstream reference uses ZSTD_compress and ZSTD_decompress.

https://github.com/HDFGroup/hdf5_plugins/blob/master/ZSTD/src/H5Zzstd.c

https://facebook.github.io/zstd/zstd_manual.html

mkitti commented 1 month ago

OK, I'm going to start working on a ZstdFrameCompressor / ZstdFrameDecompressor . I'm not sure if that is the final name yet, but it's basically the NotStream compressor / decompressor.

I'm not completely sure how to make that work in the TranscodingStream API, but we'll see.

milankl commented 1 month ago

Awesome, let me know if I can help!

milankl commented 1 month ago

@mkitti I've written already the tests for CodecZstd here using ZstdFrameCompressor, however, they are commented until https://github.com/JuliaIO/CodecZstd.jl/pull/46 is merged and released

milankl commented 1 month ago

I've set the compressor here from ZstdFrameCompressor to ZstdCompressor, but we need to release a new version of CodecZstd.jl to actually test this here properly: https://github.com/JuliaIO/CodecZstd.jl/pull/52#issuecomment-2140406751

mkitti commented 1 month ago

I think it should still be ZstdFrameCompressor. Sorry for the confusion. I'll will work on a release tonight.

mkitti commented 1 month ago

Note that CodecZstd.jl contains ZstdFrameCompressor, so there is no dependency block here now.

JonasIsensee commented 2 weeks ago

Hi, what is the status on this?

Locally, a round trip of ZstdFrameCompressor and ZstdDecompressor works just fine within JLD2 but not via HDF5. e.g.

julia> using JLD2, CodecZstd

julia> A = zeros(1000,1000);

julia> A[1] = rand()
0.44335952762563824

julia> sizeof(A)/1000^2   # 8 MB array
8.0

julia> save("test_without_compression.jld2", "A", A)

julia> save("test_with_compression.jld2", "A", A, compress=ZstdFrameCompressor())

julia> A == load("test_with_compression.jld2", "A")
true

julia> A == load("test_without_compression.jld2", "A")
true

with h5dump

HDF5 "test_with_compression.jld2" {
GROUP "/" {
   DATASET "A" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 1000, 1000 ) / ( 1000, 1000 ) }
      DATA {h5dump error: unable to print data

      }
   }
}
}

or

julia> h5open("test_with_compression.jld2") do h5f
                 h5f["A"][]
              end
┌ Error: Exception while generating log record in module H5Zzstd at /home/jisensee/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
│   exception =
│    UndefVarError: `err` not defined
│    Stacktrace:
│      [1] logging_error(logger::Any, level::Any, _module::Any, group::Any, id::Any, filepath::Any, line::Any, err::Any, real::Bool)
│        @ Base.CoreLogging ./logging.jl:478
│      [2] #invokelatest#2
│        @ ./essentials.jl:892 [inlined]
│      [3] invokelatest
│        @ ./essentials.jl:889 [inlined]
│      [4] macro expansion
│        @ ./logging.jl:364 [inlined]
│      [5] H5Z_filter_zstd(flags::UInt32, cd_nelmts::UInt64, cd_values::Ptr{UInt32}, nbytes::UInt64, buf_size::Ptr{UInt64}, buf::Ptr{Ptr{Nothing}})
│        @ H5Zzstd ~/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
│      [6] h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{Float64})
│        @ HDF5.API ~/.julia/packages/HDF5/Z859u/src/api/functions.jl:796
│      [7] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:180
│      [8] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
│      [9] getindex(::HDF5.Dataset)
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
│     [10] (::var"#5#6")(h5f::HDF5.File)
│        @ Main ./REPL[6]:2
│     [11] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
│     [12] task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File}, key::Symbol, val::HDF5.HDF5Context)
│        @ Base ./task.jl:297
│     [13] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
│     [14] h5open(f::Function, args::String)
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
│     [15] top-level scope
│        @ REPL[6]:1
│     [16] eval
│        @ ./boot.jl:385 [inlined]
│     [17] eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150
│     [18] repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
│     [19] start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
│     [20] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
│     [21] run_repl(repl::REPL.AbstractREPL, consumer::Any)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
│     [22] (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module)
│        @ Base ./client.jl:432
│     [23] #invokelatest#2
│        @ ./essentials.jl:892 [inlined]
│     [24] invokelatest
│        @ ./essentials.jl:889 [inlined]
│     [25] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
│        @ Base ./client.jl:416
│     [26] exec_options(opts::Base.JLOptions)
│        @ Base ./client.jl:333
│     [27] _start()
│        @ Base ./client.jl:552
└ @ H5Zzstd ~/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
24-element Vector{Base.StackTraces.StackFrame}:
 H5Z_filter_zstd(flags::UInt32, cd_nelmts::UInt64, cd_values::Ptr{UInt32}, nbytes::UInt64, buf_size::Ptr{UInt64}, buf::Ptr{Ptr{Nothing}}) at H5Zzstd.jl:40
 h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{Float64}) at functions.jl:796
 _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing) at readwrite.jl:180
 generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}) at readwrite.jl:146
 getindex(::HDF5.Dataset) at readwrite.jl:59
 (::var"#5#6")(h5f::HDF5.File) at REPL[6]:2
 (::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, var"#5#6", HDF5.File})() at file.jl:101
 task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, var"#5#6", HDF5.File}, key::Symbol, val::HDF5.HDF5Context) at task.jl:297
 h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}) at file.jl:96
 h5open(f::Function, args::String) at file.jl:94
 top-level scope at REPL[6]:1
 eval at boot.jl:385 [inlined]
 eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module) at REPL.jl:150
 repl_backend_loop(backend::REPL.REPLBackend, get_module::Function) at REPL.jl:246
 start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function) at REPL.jl:231
 kwcall(::NamedTuple, ::typeof(REPL.start_repl_backend), backend::REPL.REPLBackend, consumer::Any) at REPL.jl:228
 run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any) at REPL.jl:389
 run_repl(repl::REPL.AbstractREPL, consumer::Any) at REPL.jl:375
 (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module) at client.jl:432
 #invokelatest#2 at essentials.jl:892 [inlined]
 invokelatest at essentials.jl:889 [inlined]
 run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool) at client.jl:416
 exec_options(opts::Base.JLOptions) at client.jl:333
 _start() at client.jl:552
ERROR: HDF5.API.H5Error: Error reading dataset /A
libhdf5 Stacktrace:
 [1] H5Z_pipeline: Data filters/Read failed
     filter returned failure during read
  ⋮
Stacktrace:
 [1] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:209
 [2] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
 [3] getindex(::HDF5.Dataset)
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
 [4] (::var"#5#6")(h5f::HDF5.File)
   @ Main ./REPL[6]:2
 [5] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
 [6] task_local_storage(body::HDF5.var"#17#18"{…}, key::Symbol, val::HDF5.HDF5Context)
   @ Base ./task.jl:297
 [7] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
 [8] h5open(f::Function, args::String)
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
 [9] top-level scope
   @ REPL[6]:1

caused by: HDF5.API.H5Error: Error reading dataset /A
libhdf5 Stacktrace:
 [1] H5Z_pipeline: Data filters/Read failed
     filter returned failure during read
  ⋮
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/HDF5/Z859u/src/api/error.jl:18 [inlined]
  [2] h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{…})
    @ HDF5.API ~/.julia/packages/HDF5/Z859u/src/api/functions.jl:800
  [3] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:180
  [4] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
  [5] getindex(::HDF5.Dataset)
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
  [6] (::var"#5#6")(h5f::HDF5.File)
    @ Main ./REPL[6]:2
  [7] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
  [8] task_local_storage(body::HDF5.var"#17#18"{…}, key::Symbol, val::HDF5.HDF5Context)
    @ Base ./task.jl:297
  [9] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
 [10] h5open(f::Function, args::String)
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
 [11] top-level scope
    @ REPL[6]:1
Some type information was truncated. Use `show(err)` to see complete types.
mkitti commented 2 weeks ago

This should work now:

julia> using H5Zzstd, HDF5

julia> h5open("test_with_compression.jld2") do h5f
           h5f["A"][]
       end

julia> using HDF5_jll

julia> run(`$(HDF5_jll.h5ls()) -va test_with_compression.jld2`);
Opened "test_with_compression.jld2" with sec2 driver.
A                        Dataset {1000/1000, 1000/1000}
    Location:  1:48
    Links:     1
    Chunks:    {1000, 1000} 8000000 bytes
    Storage:   8000000 logical bytes, 283 allocated bytes, 2826855.12% utilization
    Filter-0:  ZSTD-32015  {5}
    Type:      native double
    Address: 189
           Flags    Bytes     Address          Logical Offset
        ========== ======== ========== ==============================
        0x00000000      283        189 [0, 0, 0]
JonasIsensee commented 2 weeks ago

This should work now:

julia> using H5Zzstd, HDF5

julia> h5open("test_with_compression.jld2") do h5f
           h5f["A"][]
       end

julia> using HDF5_jll

julia> run(`$(HDF5_jll.h5ls()) -va test_with_compression.jld2`);
Opened "test_with_compression.jld2" with sec2 driver.
A                        Dataset {1000/1000, 1000/1000}
    Location:  1:48
    Links:     1
    Chunks:    {1000, 1000} 8000000 bytes
    Storage:   8000000 logical bytes, 283 allocated bytes, 2826855.12% utilization
    Filter-0:  ZSTD-32015  {5}
    Type:      native double
    Address: 189
           Flags    Bytes     Address          Logical Offset
        ========== ======== ========== ==============================
        0x00000000      283        189 [0, 0, 0]

Thank you! I can reproduce locally.

Thanks for your work :)

milankl commented 1 week ago

@mkitti you want to include a compatibility constrain with the CodecZstd v0.8.3? Probably not hence you suggested the extension? Otherwise this seems to be ready to be merged?

JonasIsensee commented 1 week ago

Can this even be done, given that CodecZstd isn't even a dependency or weak dep?

milankl commented 1 week ago

Can this even be done, given that CodecZstd isn't even a dependency or weal dep?

riiiiiight, forgot that. Any reason then not to merge this?

mkitti commented 1 week ago

To be honest, I do not really understand how the compressors are being loaded here.

mkitti commented 1 week ago

I think that we should declare CodecZlib, CodecBzip2, and now CodecZstd as a weak dependencies with compat ranges. Eventually, it would be good to consider how to take advantage of package extensions.

Over in https://github.com/JuliaIO/HDF5.jl/pull/1160, I am working on converting all the HDF5 filter packages into package extensions.

JonasIsensee commented 6 days ago

To be honest, I do not really understand how the compressors are being loaded here.

Compressor loading works the way it was done in FileIO for loading backends a few years ago. I'm not sure this has changed.

The logic finds out which decompressor is needed for loading and checks if the library is loaded. If not, it checks if it is installed - in which case it then tries to load it.

Pkg extensions would not help here, since there are zero lines of code specific to the individual backends.

mkitti commented 5 days ago

Pkg extensions would not help here, since there are zero lines of code specific to the individual backends.

That's not very clear to me. Right now the loading mechanism ends up being very dynamic. I think this could be studied further.

JonasIsensee commented 2 days ago

I think, this is a discussion for another day. I'll merge this for now.