JuliaIO / JLD.jl

Saving and loading julia variables while preserving native types
MIT License
277 stars 55 forks source link

load of Array{DateTime,1} getting slower inside loop #64

Open abieler opened 8 years ago

abieler commented 8 years ago

jldTimings.zip When loading an array with DateTimes with the load() function the loading times increase over time. The same does not happen when loading an array of floats.

Attached are a julia script and two data files to reproduce this behavior. Run the script with julia timeJLD.jl N where N is the number of iterations.

myDates.jld has the array with datetimes date myArray.jld has the array with floats yy

I ran with N = 5k to 10k.

using HDF5
using JLD
using PyPlot

function myLoop(N, timings)
    for i = 1:N
      timings[i] = @elapsed tt = load(fileName, "date")
      #timings[i] = @elapsed tt = load(fileName, "yy")
    end
end

N = parse(Int, ARGS[1]) 
fileName = "myDates.jld"
#fileName = "myArray.jld"

timings = Array(Float64, N)

myLoop(N, timings)

figure()
semilogy(timings)
show()

In real life I load the content from different files of course... Cheers Andre

abieler commented 8 years ago

I forgot I am on linux and v0.4.5

Skylion007 commented 8 years ago

Sounds like this is the issue as well. With how long it's been around, it seems like they have marked as "Do not fix" with JLD. Quite a shame really. Sounds like you'll have to figure out how to use the HDF5 format instead as well.

abieler commented 8 years ago

I now convert my dates to unix-time and save them as h5. then loading and converting back to dates with Dates.unix2datetime()

I attached timings for two versions of loading the data. 1st with h5read() and 2nd with opening the file for read with fid = h5open() and then loading data with read(fid, ...).

Not surprising the last version is the fastest. For the first 1 k loops the timing differences seem almost constant, but after ~10 k iterations the jld version is about 2 orders of magnitude slower. If I get to it I ll do some profiling.

newTimings.zip

abieler commented 8 years ago

So most time is spent in h5f_get_obj_ids() in HDF5/src/plain.jl at line 2182 and 2186 which is a ccall to (:H5Fget_obj_count, libhdf5) and (:H5Fget_obj_ids, libhdf5) respectively.

So not sure something can be done about this..

cheers andre

timholy commented 8 years ago

Bless you, @abieler, for digging into this! So it's definitely the C library, not any of the julia code.

Try the trick in the last post of that issue, https://github.com/JuliaLang/HDF5.jl/issues/170#issuecomment-209399736?

abieler commented 8 years ago

Not sure it is the same problem. This here is loading content from a small file a lot of times, the other is creating a file with lots of entries. I ll try anyway of course ;)

timholy commented 8 years ago

Oh, I see (I didn't read carefully enough). You might consider using the "dictionary interface," https://github.com/JuliaLang/JLD.jl/blob/master/doc/jld.md#usage, so it doesn't waste time opening/closing the file frequently.

JeffBezanson commented 8 years ago

Also appears similar to https://github.com/JuliaLang/julia/issues/17554