JuliaIO / JLD2.jl

HDF5-compatible file format in pure Julia
Other
549 stars 85 forks source link

Inexact Error during saving large data. #399

Closed racinmat closed 1 year ago

racinmat commented 2 years ago

I tried to some large data and it fails to save them, so I made this MWE: The persistence fails in both cases, with, and without compression:

(@v1.7) pkg> st
      Status `~/.julia/environments/v1.7/Project.toml`
  [634d3b9d] DrWatson v2.7.5
  [033835bb] JLD2 v0.4.22

julia> a_string = "AAaa" ^ 2^32;

julia> using JLD2

julia> JLD2.@save "a_long_file.jld2" {compress=true} a_string
ERROR: InexactError: trunc(UInt32, 17179869184)
Stacktrace:
  [1] throw_inexacterror(f::Symbol, #unused#::Type{UInt32}, val::Int64)
    @ Core ./boot.jl:612
  [2] checked_trunc_uint
    @ ./boot.jl:642 [inlined]
  [3] toUInt32
    @ ./boot.jl:726 [inlined]
  [4] UInt32
    @ ./boot.jl:766 [inlined]
  [5] convert
    @ ./number.jl:7 [inlined]
  [6] BasicDatatype
    @ ~/.julia/packages/JLD2/k9Gt0/src/datatypes.jl:22 [inlined]
  [7] StringDatatype
    @ ~/.julia/packages/JLD2/k9Gt0/src/datatypes.jl:29 [inlined]
  [8] h5type
    @ ~/.julia/packages/JLD2/k9Gt0/src/data/specialcased_types.jl:20 [inlined]
  [9] h5type
    @ ~/.julia/packages/JLD2/k9Gt0/src/data/writing_datatypes.jl:132 [inlined]
 [10] write_dataset
    @ ~/.julia/packages/JLD2/k9Gt0/src/datasets.jl:520 [inlined]
 [11] write(g::JLD2.Group{JLD2.JLDFile{JLD2.MmapIO}}, name::String, obj::String, wsession::JLD2.JLDWriteSession{Dict{UInt64, JLD2.RelOffset}}; compress::Nothing)
    @ JLD2 ~/.julia/packages/JLD2/k9Gt0/src/compression.jl:87
 [12] #write#87
    @ ~/.julia/packages/JLD2/k9Gt0/src/compression.jl:71 [inlined]
 [13] write(f::JLD2.JLDFile{JLD2.MmapIO}, name::String, obj::String, wsession::JLD2.JLDWriteSession{Dict{UInt64, JLD2.RelOffset}})
    @ JLD2 ~/.julia/packages/JLD2/k9Gt0/src/compression.jl:71
 [14] top-level scope
    @ ~/.julia/packages/JLD2/k9Gt0/src/loadsave.jl:66

julia> JLD2.@save "a_long_file.jld2" a_string
ERROR: InexactError: trunc(UInt32, 17179869184)
Stacktrace:
  [1] throw_inexacterror(f::Symbol, #unused#::Type{UInt32}, val::Int64)
    @ Core ./boot.jl:612
  [2] checked_trunc_uint
    @ ./boot.jl:642 [inlined]
  [3] toUInt32
    @ ./boot.jl:726 [inlined]
  [4] UInt32
    @ ./boot.jl:766 [inlined]
  [5] convert
    @ ./number.jl:7 [inlined]
  [6] BasicDatatype
    @ ~/.julia/packages/JLD2/k9Gt0/src/datatypes.jl:22 [inlined]
  [7] StringDatatype
    @ ~/.julia/packages/JLD2/k9Gt0/src/datatypes.jl:29 [inlined]
  [8] h5type
    @ ~/.julia/packages/JLD2/k9Gt0/src/data/specialcased_types.jl:20 [inlined]
  [9] h5type
    @ ~/.julia/packages/JLD2/k9Gt0/src/data/writing_datatypes.jl:132 [inlined]
 [10] write_dataset
    @ ~/.julia/packages/JLD2/k9Gt0/src/datasets.jl:520 [inlined]
 [11] write(g::JLD2.Group{JLD2.JLDFile{JLD2.MmapIO}}, name::String, obj::String, wsession::JLD2.JLDWriteSession{Dict{UInt64, JLD2.RelOffset}}; compress::Nothing)
    @ JLD2 ~/.julia/packages/JLD2/k9Gt0/src/compression.jl:87
 [12] #write#87
    @ ~/.julia/packages/JLD2/k9Gt0/src/compression.jl:71 [inlined]
 [13] write(f::JLD2.JLDFile{JLD2.MmapIO}, name::String, obj::String, wsession::JLD2.JLDWriteSession{Dict{UInt64, JLD2.RelOffset}})
    @ JLD2 ~/.julia/packages/JLD2/k9Gt0/src/compression.jl:71
 [14] top-level scope
    @ ~/.julia/packages/JLD2/k9Gt0/src/loadsave.jl:66
JonasIsensee commented 2 years ago

Hi @racinmat , thank you for reporting this. This appears to be a bug related to strings / not-implemented feature. Please note, that compression only works for arrays and not strings, so this is independent.

Here's a workaround with the added benefit of allowing for compression:

julia> a_string = "AAaa" ^ 2^32;
julia> a_vector = Vector{UInt8}(a_string);
julia> jldsave("a_long_file.jld2", true;  a_vector)
julia> a_loaded = String(load("a_long_file.jld2", "a_vector"));
julia> a_loaded == a_string

Tbh: I'm a bit curious what kind of problem would generate so ridiculously long strings. Your "mwe" runs out of memory on my 32GB + 18GB swap machine.... I've primarily seen people with a background in C++ working with (and over-using) strings. (If that's you, there may be serious performance improvements in switching to array based code)

racinmat commented 2 years ago

Thanks. I used strings here, because it was easiest to implement for me, but in fact I got weird error when serializing different data structure, so if this is string-specific, I will try to make MWE without long strings.

felixhorger commented 2 years ago

Hi, I am afraid this is more than stated above, e.g. saving a 8Gb Float64 array jldsave("test.jld2", true; a=zeros(Float64, 1000, 1000000)) fails with the same error on my machine with the same setup and error (apart from the size, i.e. trunc(UInt32, 8000000000)) as above. Why is the size attribute here not a UInt64? As far as I can see that is causing the issue. Cheers, Felix

JonasIsensee commented 2 years ago

Hi @felixhorger,

this is indeed interesting. I wont be able to do any testing myself for another two weeks but I can say his much.

The size attribute is 32bits because the hdf5 format spec says so. That will not be changed. However, this should also never really be a problem. A BasicDatatype should be used for basic/simple objects. (Array elements rather than big things)

Anything on the order of typemax(UInt32) certainly doesn't fit that definition. Since you're hitting this error anyway, we could consider switching to a different encoding for objects larger X. (Or maybe it's a different bug altogether)

felixhorger commented 2 years ago

You are right! The error goes all the way back to CodecZlib.jl. It can only take a ~4Gb block at a time. So either the codec is switched for large objects in JLD2.jl, or the way the transcode function works with the Zlib codec should be corrected in CodecZlib.jl (which I think is the julian way).

JonasIsensee commented 2 years ago

That would be an option.

Another one would be to implement what is called hdf5 array chunking. This feature is currently not implemented in JLD2.

It allows storing a large array (N,M,....) in chunks of size (n,m,...) which may themselves be compressed.

This could (in principle) be done automatically for arrays exceeding a certain size.

felixhorger commented 2 years ago

I think I solved the issue by modifying CodecZlib.jl, see this pull request. I can now jldsave and load arrays larger than typemax(UInt32) bytes.

BioTurboNick commented 1 year ago

I just encountered this issue. v0.4.29

May be related, I'm using CodecBzip2.jl

JonasIsensee commented 1 year ago

Can your test this with CodecBzip2.jl directly?

BioTurboNick commented 1 year ago

Actually, it might be different - the stack trace is quite different. I'll just open a new issue and go from there.