Closed racinmat closed 1 year ago
Hi @racinmat , thank you for reporting this. This appears to be a bug related to strings / not-implemented feature. Please note, that compression only works for arrays and not strings, so this is independent.
Here's a workaround with the added benefit of allowing for compression:
julia> a_string = "AAaa" ^ 2^32;
julia> a_vector = Vector{UInt8}(a_string);
julia> jldsave("a_long_file.jld2", true; a_vector)
julia> a_loaded = String(load("a_long_file.jld2", "a_vector"));
julia> a_loaded == a_string
Tbh: I'm a bit curious what kind of problem would generate so ridiculously long strings. Your "mwe" runs out of memory on my 32GB + 18GB swap machine.... I've primarily seen people with a background in C++ working with (and over-using) strings. (If that's you, there may be serious performance improvements in switching to array based code)
Thanks. I used strings here, because it was easiest to implement for me, but in fact I got weird error when serializing different data structure, so if this is string-specific, I will try to make MWE without long strings.
Hi, I am afraid this is more than stated above, e.g. saving a 8Gb Float64 array
jldsave("test.jld2", true; a=zeros(Float64, 1000, 1000000))
fails with the same error on my machine with the same setup and error (apart from the size, i.e. trunc(UInt32, 8000000000)
) as above.
Why is the size
attribute here not a UInt64
? As far as I can see that is causing the issue.
Cheers, Felix
Hi @felixhorger,
this is indeed interesting. I wont be able to do any testing myself for another two weeks but I can say his much.
The size attribute is 32bits because the hdf5 format spec says so. That will not be changed. However, this should also never really be a problem. A BasicDatatype should be used for basic/simple objects. (Array elements rather than big things)
Anything on the order of typemax(UInt32) certainly doesn't fit that definition. Since you're hitting this error anyway, we could consider switching to a different encoding for objects larger X. (Or maybe it's a different bug altogether)
You are right! The error goes all the way back to CodecZlib.jl
. It can only take a ~4Gb block at a time. So either the codec is switched for large objects in JLD2.jl
, or the way the transcode
function works with the Zlib codec should be corrected in CodecZlib.jl
(which I think is the julian way).
That would be an option.
Another one would be to implement what is called hdf5 array chunking. This feature is currently not implemented in JLD2.
It allows storing a large array (N,M,....) in chunks of size (n,m,...) which may themselves be compressed.
This could (in principle) be done automatically for arrays exceeding a certain size.
I think I solved the issue by modifying CodecZlib.jl
, see this pull request.
I can now jldsave
and load
arrays larger than typemax(UInt32)
bytes.
I just encountered this issue. v0.4.29
May be related, I'm using CodecBzip2.jl
Can your test this with CodecBzip2.jl directly?
Actually, it might be different - the stack trace is quite different. I'll just open a new issue and go from there.
I tried to some large data and it fails to save them, so I made this MWE: The persistence fails in both cases, with, and without compression: