JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
380 stars 138 forks source link

Enormous file sizes #30

Closed cmmp closed 9 years ago

cmmp commented 10 years ago

Hi,

I have text files of generally 100k that have data such as:

NA NA NA
NA NA NA
-4.11554869953487 NA NA
-4.49517142619306 NA NA
-4.62434879575859 NA NA
-4.85365577849306 NA NA
-4.83319566688069 NA NA
-4.62021998272287 NA NA
-4.38650861894108 NA NA
-4.33796653562191 NA NA
...

using the code

using HDF5
using JLD

x = readdlm("1.dat", ' ')

file = jldopen("teste.jld", "w")
@write file x
close(file)

to write a 117K file of that form to teste.jld generates a 13Mb file... Even if compression is not being used, I don't understand the size difference. I have to process 270 files of this kind, which ended up generating a 3.5Gb file.

Am I doing something wrong? If it helps, I can e-mail a sample file for testing.

I'm on OS X using julia master/9c392b7*, HDF5 f27612 and hdf5 installed from homebrew:

Cassios-iMac:~ cassio$ brew info hdf5
hdf5: stable 1.8.11
http://www.hdfgroup.org/HDF5
/usr/local/Cellar/hdf5/1.8.11 (119 files, 9.8M) *
  Built from source
From: https://github.com/mxcl/homebrew/commits/master/Library/Formula/hdf5.rb
==> Dependencies
Required: szip
==> Options
--enable-cxx
    Compile C++ bindings
--enable-fortran
    Compile Fortran bindings
--enable-fortran2003
    Compile Fortran 2003 bindings. Requires enable-fortran.
--enable-parallel
    Compile parallel bindings
--enable-threadsafe
    Trade performance and C++ or Fortran support for thread safety
--universal
    Build a universal binary

Thanks, Cássio

timholy commented 10 years ago

Can you convert those NA's into NaN's? Then you'd be storing as a plain array, which should be very efficient.

The other option would be to write a specific method for encoding those NAs. This hasn't come up before, so it's not available yet.

cmmp commented 10 years ago

Hi Tim, converting to NaN does work around the problem, thanks.

But I think handling the NAs would be a nice feature.

timholy commented 10 years ago

I just tested your example, and now I see the issue: the result of readdlm is an Array{Any}. That means each entry of the array has to be stored as a separate reference variable in JLD. I don't think it's reasonable for JLD to look for this as a special case and do something different, because (1) doing so would slow the performance of every other operation, and (2) what if someone wanted to store an array of strings that happened to include "NA"?

The only answer here is to require that the user put it in some more-specific type before writing it. Instead of converting to Float64 and using NaN (I can imagine you might want to distinguish NA from NaN sometimes?), the better answer would be to use a DataArray from DataFrames. I suspect those will already be stored pretty efficiently, but let me know if that's not the case.

cmmp commented 10 years ago

Hi Tim, thanks for the explanation, now I get what was going on. I tried this route because of issue #29 with DataFrames.

I don't think NA should be a special case, but I think something should be done about Array{Any}, at least a code warning that file sizes may be big or a warning on the manual.

Definitely NA does not correspond to NaN, I see it just as a workaround.

Substituting the code for:

using DataFrames
x = readtable("1.dat", separator = ' ', header = false)

does the trick and the file size becomes 79K.

I think that, in the future, some compression scheme could be used on Array{Any}. Simply generating binary files 10x bigger than text files just seems unreasonable.

Thanks again, Cássio

cmmp commented 10 years ago

actually, a 100x bigger with my example.

simonster commented 10 years ago

The JLD file contains type information for each element of the array that isn't in your text file, so it shouldn't be surprising that it's larger, although 100X is clearly undesirable. #27 will (hopefully) make the way we store type information more efficient, which may help with this. Generally you want to avoid using an Array{Any} anyway, since computations on it will be many times slower than they'd be on an Array{Float64} or a DataFrame.

timholy commented 9 years ago

The situation should have improved, although JLD will always be better with well-typed arrays. Closing.