appier / h5sparse

Scipy sparse matrix in HDF5
https://pypi.python.org/pypi/h5sparse
MIT License
45 stars 8 forks source link

Size on disk larger than expected #18

Closed faresVS closed 4 years ago

faresVS commented 4 years ago

Hello,

I would like to use hdf5 to store a bunch of sparse matrices. h5sparse seems to provide everything I need. But I am surprised with the size on disk, which is around twice larger than what I expected.

In the code below, I create a list of 300 sparse matrices. Each of them corresponds to a sparse representation of a 1x2000000 array in which about 1.5% of the elements are non-zeros. When I compare the cumulative sizes of the 300 separate sparse matrices saved independently (using scipy.sparse.save_npz()) and the equivalent hdf5 file generated using h5sparse, the size of the hdf5 file is around twice bigger than the total I get from the .npz I tried two approaches to store several sparse matrices with h5sparse: either create a new dataset for each matrix, or create a single dataset with matrices appended (and then using slicing to reload). In both cases I can retrieve the data correctly, but in both cases the size on disk is around twice as big as expected.

import numpy as np
from scipy.sparse import csr_matrix, load_npz, save_npz
import h5sparse

def get_random_array_int(nb_elem=15, prob=0.1, dtype=np.uint16):
    array_full = np.random.randint(np.iinfo(dtype).min, np.iinfo(dtype).max, nb_elem)
    array_bin = (np.random.rand(nb_elem) < prob)
    return array_full * array_bin

array_nb_elem = 2000000
prob = 0.015

array_dense = get_random_array_int(nb_elem=array_nb_elem, prob=prob, dtype=np.uint16)
nb_non_zero_elem = (array_dense > 0).sum()
print(nb_non_zero_elem)
print(nb_non_zero_elem / array_nb_elem)  # as expected, we get something around 0.015

array_sparse = csr_matrix(array_dense)
assert (array_sparse != 0).sum() == nb_non_zero_elem
print(repr(array_sparse))
assert (array_dense == array_sparse.toarray()).all()

##########  generate a list of 300 arrays
list_dense_arrays = [get_random_array_int(nb_elem=array_nb_elem, prob=prob, dtype=np.uint16) for _ in range(300)]

####### use scipy.sparse save_npz/load_npz
# save the sparse arrays as separate npz files
for idx, dense_array in enumerate(list_dense_arrays):
    sparse_array = csr_matrix(dense_array)
    save_npz("dummy_{}.npz".format(idx), sparse_array)

# check array could be properly reloaded
dense_array_reloaded_from_npz = load_npz("dummy_9.npz").toarray()
assert (dense_array_reloaded_from_npz == list_dense_arrays[9]).all()

####### use h5sparse with multiple datasets
# create the file
with h5sparse.File("multiple_datasets.h5", "w") as f:
    for idx, dense_array in enumerate(list_dense_arrays):
        sparse_array = csr_matrix(dense_array)
        f.create_dataset("my_dataset_{}".format(idx), data=sparse_array)

# check array could be properly reloaded
with h5sparse.File("multiple_datasets.h5", "r") as f:
    dense_array_reloaded_from_h5_multiple_datasets = f["my_dataset_9"][()].toarray()
    assert (dense_array_reloaded_from_h5_multiple_datasets == list_dense_arrays[9]).all()

####### use h5sparse with a single dataset
# create the file
with h5sparse.File("single_dataset.h5", "w") as f:
    for idx, dense_array in enumerate(list_dense_arrays):
        sparse_array = csr_matrix(dense_array)
        if idx == 0:
            f.create_dataset("my_dataset", data=sparse_array, maxshape=(None,))
        else:
            f["my_dataset"].append(sparse_array)

# check array could be properly reloaded
with h5sparse.File("single_dataset.h5", "r") as f:
    dense_array_reloaded_from_h5_single_dataset = f["my_dataset"][9:10].toarray()
    assert (dense_array_reloaded_from_h5_single_dataset == list_dense_arrays[9]).all()

In all 3 cases, I can save and retrieve the data correctly. But when I compare the sizes on disk, I get the following result :

Is there anything wrong with the way I use h5sparse ? Is there anything I can do to make things better in terms of storage size ?

ianlini commented 4 years ago

scipy.sparse.save_npz() might compress the file. It has a parameter compressed and its default value is True. You can try to turn off the compression and see whether they have similar sizes. h5py also supports compression: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline and h5sparse.File.create_dataset() will pass most arguments to h5py.File.create_dataset(), so you can try this:

f.create_dataset(..., compression="gzip")
faresVS commented 4 years ago

Hello,

Yes, you are perfectly right ! npz files were indeed compressed by default. When I set the compressed flag to False, I get a similar size of ~104 M. I wasn't aware it was possible to use the compression on h5sparse. The suggestion you proposed works : when using gzip compression I get the size around ~42 Mo as expected. Using multiple datasets (one per matrix) creates a file which is a little bit larger than using a single dataset (using append()) : 44 M instead of 42 M. This sounds perfectly reasonable.

Thank you very much for you answer.