JuliaIO / BSON.jl

Other
157 stars 39 forks source link

loading large dataset is very slow (compared to JLD2) #2

Open tpapp opened 6 years ago

tpapp commented 6 years ago

I tried BSON.jl with a large dataset. Cf JLD2:

julia> @time load("/tmp/test.jld2")
  5.885191 seconds (21.85 M allocations: 2.056 GiB, 60.13% gc time)
Dict{String,Any} with 2 entries:
  "levels"      => 2247195×6 DataFrames.DataFrame…
  "transitions" => 3775512×7 DataFrames.DataFrame…

whereas BSON.load("/tmp/test.bson") never terminates (I interrupted after 5 minutes). The file itself is around 600M (slightly smaller than the JLD2 file which is around 700M). I can make an MWE if the issue is not known.

MikeInnes commented 6 years ago

So is it very slow, or non-terminating? An MWE would definitely be useful.

I didn't particularly write the BSON code to be fast, so there's probably a lot of low-hanging fruit for performance. Loading large isbits arrays should certainly have similar performance with any serialiser.

richiejp commented 5 years ago

I have pretty much the same issue, can't even load my data set (which is 246M in BSON and 319M in JLD2) because it crashes (with no error, I assume it hits some limit). I am surprised that the BSON is smaller considering that you store a type description with each instance of a struct (and my dataset includes a lot of nested structs), but then I have no idea what JLD2 is doing.

Anyway I made a PR #23 with a benchmark and 3-4x speed improvement so far. Probably entirely from improving array processing.

richiejp commented 5 years ago

The data set now loads in about ~30 seconds. If done independently parsing takes ~10 seconds and raising to my types about ~15 seconds.

It turns out the crash was happening during raising and is possibly a bug in Julia:

julia> BSON.raise_recursive(res)

signal (11): Segmentation fault
in expression starting at no file:0
has_free_typevars at /buildworker/worker/package_linux64/build/src/jltypes.c:151
has_free_typevars at /buildworker/worker/package_linux64/build/src/jltypes.c:155 [inlined]
jl_has_free_typevars at /buildworker/worker/package_linux64/build/src/jltypes.c:180
inst_type_w_ at /buildworker/worker/package_linux64/build/src/jltypes.c:1504
jl_instantiate_unionall at /buildworker/worker/package_linux64/build/src/jltypes.c:940
arg_type_tuple at /buildworker/worker/package_linux64/build/src/gf.c:1648
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2151
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1537 [inlined]
jl_f__apply at /buildworker/worker/package_linux64/build/src/builtins.c:556
newstruct at /home/rich/.julia/dev/BSON/src/extensions.jl:103
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1831
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1537 [inlined]
jl_f__apply at /buildworker/worker/package_linux64/build/src/builtins.c:556
#43 at /home/rich/.julia/dev/BSON/src/extensions.jl:115
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1831
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
_raise_recursive at /home/rich/.julia/dev/BSON/src/read.jl:79
#45 at /home/rich/.julia/dev/BSON/src/extensions.jl:124
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
raise_recursive at /home/rich/.julia/dev/BSON/src/read.jl:88
...

The crash happens when newstuct tries to call newstruct! on extensions.jl:102 (line number is different in stack trace because I put initstruct on a separate line). Copying the contents of newstruct! inline avoids the crash.