JuliaIO / BSON.jl

Other
158 stars 39 forks source link

improve performance #38

Open SimonDanisch opened 5 years ago

SimonDanisch commented 5 years ago

After benchmarking BSON to figure out if it's a good replacement for JSON, I noticed that its by far the slowest. This PR fixes the performance partly:

@time BSON.bson("test.bson_master", Dict(:a=>x)) # 3.2s
@time open(io-> JSON.print(io, x), "test.json", "w") # 1.6s
@time open(io-> JSON2.write(io, x), "test.json2", "w") # 1.1s
@time BSON.bson("test.bson_this_pr", Dict(:a=>x)) # 0.26s
@time open(io-> CBOR.encode(io, x), "test.cbor", "w") # 0.16

It's still slower than CBOR, which also isn't doing the fastest it could - so I guess there are still some other low hanging fruits in here ;)

richiejp commented 5 years ago

This appears to give ~10% on my benchmarks, but it would be way higher for more numerical data I guess. Added it to my test branch, see #34

KristofferC commented 5 years ago

This doesn't seem safe in general? What if x goes out of scope before the unsafe_wrap Array.

KristofferC commented 5 years ago

I tried this on VGG19 from Metalhead and got

julia> @time Metalhead.weights("vgg19.bson")
  0.662478 seconds (1.65 M allocations: 631.681 MiB, 15.95% gc time)
Dict{Symbol,Any} with 38 entries:
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x1f353f3d -- getindex at .\array.jl:730 [inlined]
getindex at .\subarray.jl:228 [inlined]
_show_nonempty at .\arrayshow.jl:386
in expression starting at no file:0
getindex at .\array.jl:730 [inlined]
getindex at .\subarray.jl:228 [inlined]
_show_nonempty at .\arrayshow.jl:386
#383 at .\arrayshow.jl:404
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2219
show_nd at .\arrayshow.jl:294
_show_nonempty at .\arrayshow.jl:403 [inlined]
show at .\arrayshow.jl:421
SimonDanisch commented 5 years ago

Guess it's unsafe indeed :P I was assuming, it's only used just before sending it to write, but it's a bit hard to tell where things go. Maybe it'd be better to implement a write that works with an reinterpret array efficiently.

KristofferC commented 5 years ago

Maybe it'd be better to implement a write that works with an reinterpret array efficiently.

Yeah, probably. https://github.com/MikeInnes/BSON.jl/pull/46 is at least a lot better than what we have now.

SimonDanisch commented 5 years ago

Yup, I was thinking to just submit something like #46, but then I got greedy :D

SimonDanisch commented 5 years ago

I'd say just merge #46 and leave this unmerged until we find something safer ;)

KristofferC commented 5 years ago

Interestingly, just defining

function reinterpret_(::Type{T}, x) where T
  reinterpret(T, x)
end

I can load VGG19 pretty damn fast

julia> @time Metalhead.weights("vgg19.bson")
  0.269944 seconds (1.97 k allocations: 548.196 MiB, 37.98% gc time)
Dict{Symbol,Any} with 38 entries:

but it crashes when testing BSON itself.

richiejp commented 5 years ago

@KristofferC probably because reinterpret returns ReinterpretArray which wraps the original array and lazily casts its contents when accessed.