Closed cmmp closed 9 years ago
Can you convert those NA's into NaN's? Then you'd be storing as a plain array, which should be very efficient.
The other option would be to write a specific method for encoding those NAs. This hasn't come up before, so it's not available yet.
Hi Tim, converting to NaN does work around the problem, thanks.
But I think handling the NAs would be a nice feature.
I just tested your example, and now I see the issue: the result of readdlm
is an Array{Any}
. That means each entry of the array has to be stored as a separate reference variable in JLD. I don't think it's reasonable for JLD to look for this as a special case and do something different, because (1) doing so would slow the performance of every other operation, and (2) what if someone wanted to store an array of strings that happened to include "NA"?
The only answer here is to require that the user put it in some more-specific type before writing it. Instead of converting to Float64
and using NaN
(I can imagine you might want to distinguish NA from NaN sometimes?), the better answer would be to use a DataArray from DataFrames. I suspect those will already be stored pretty efficiently, but let me know if that's not the case.
Hi Tim, thanks for the explanation, now I get what was going on. I tried this route because of issue #29 with DataFrames.
I don't think NA should be a special case, but I think something should be done about Array{Any}
, at least a code warning that file sizes may be big or a warning on the manual.
Definitely NA does not correspond to NaN, I see it just as a workaround.
Substituting the code for:
using DataFrames
x = readtable("1.dat", separator = ' ', header = false)
does the trick and the file size becomes 79K.
I think that, in the future, some compression scheme could be used on Array{Any}
. Simply generating binary files 10x bigger than text files just seems unreasonable.
Thanks again, Cássio
actually, a 100x bigger with my example.
The JLD file contains type information for each element of the array that isn't in your text file, so it shouldn't be surprising that it's larger, although 100X is clearly undesirable. #27 will (hopefully) make the way we store type information more efficient, which may help with this. Generally you want to avoid using an Array{Any}
anyway, since computations on it will be many times slower than they'd be on an Array{Float64}
or a DataFrame
.
The situation should have improved, although JLD will always be better with well-typed arrays. Closing.
Hi,
I have text files of generally 100k that have data such as:
using the code
to write a 117K file of that form to
teste.jld
generates a 13Mb file... Even if compression is not being used, I don't understand the size difference. I have to process 270 files of this kind, which ended up generating a 3.5Gb file.Am I doing something wrong? If it helps, I can e-mail a sample file for testing.
I'm on OS X using julia master/9c392b7*, HDF5 f27612 and hdf5 installed from homebrew:
Thanks, Cássio