Closed faresVS closed 4 years ago
scipy.sparse.save_npz()
might compress the file. It has a parameter compressed
and its default value is True
.
You can try to turn off the compression and see whether they have similar sizes.
h5py
also supports compression: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline
and h5sparse.File.create_dataset()
will pass most arguments to h5py.File.create_dataset()
, so you can try this:
f.create_dataset(..., compression="gzip")
Hello,
Yes, you are perfectly right !
npz files were indeed compressed by default. When I set the compressed
flag to False
, I get a similar size of ~104 M.
I wasn't aware it was possible to use the compression on h5sparse
. The suggestion you proposed works : when using gzip compression I get the size around ~42 Mo as expected. Using multiple datasets (one per matrix) creates a file which is a little bit larger than using a single dataset (using append()
) : 44 M instead of 42 M. This sounds perfectly reasonable.
Thank you very much for you answer.
Hello,
I would like to use hdf5 to store a bunch of sparse matrices.
h5sparse
seems to provide everything I need. But I am surprised with the size on disk, which is around twice larger than what I expected.In the code below, I create a list of 300 sparse matrices. Each of them corresponds to a sparse representation of a 1x2000000 array in which about 1.5% of the elements are non-zeros. When I compare the cumulative sizes of the 300 separate sparse matrices saved independently (using
scipy.sparse.save_npz()
) and the equivalent hdf5 file generated usingh5sparse
, the size of the hdf5 file is around twice bigger than the total I get from the.npz
I tried two approaches to store several sparse matrices withh5sparse
: either create a new dataset for each matrix, or create a single dataset with matrices appended (and then using slicing to reload). In both cases I can retrieve the data correctly, but in both cases the size on disk is around twice as big as expected.In all 3 cases, I can save and retrieve the data correctly. But when I compare the sizes on disk, I get the following result :
du -ch *.npz
) : 42 MIs there anything wrong with the way I use
h5sparse
? Is there anything I can do to make things better in terms of storage size ?