fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 42 forks source link

write.fst can write multidimensional matrices to file #19

Open MarcusKlik opened 7 years ago

MarcusKlik commented 7 years ago

And provide fast compression with random access to the matix. Check if there is a use-case for such a feature.

PeteHaitch commented 7 years ago

+1 to this feature request. It would be interesting to see how compares to the HDF5 format, a popular choice for on-disk storage of multidimensional arrays. FYI, the rhdf5 Bioconductor package is a low-level package to access the HDF5 API while the HDF5Array Bioconductor package is a higher-level package that abstracts away the need to really access the HDF5 API.

Back when fst was first released, I experimented with creating a fstarray package (an experiment I abandonded). I think an important feature is to allow efficient, arbitrary subsetting of these "fstarrays"; i.e. a version of read.fst() that allowed reading in multiple and possibly non-contiguous chunks of a .fst file.

MarcusKlik commented 7 years ago

Hi @PeteHaitch , thanks for the feature request and interesting to see that you experimented with a fstarray package to address the storage of matrices! Indeed, random access on structures with many columns is not very efficient for a columnar based file format (like fst or feather). I've been thinking a bit and it seems to me that we can solve that problem by storing the data not in columns but in blocks. Let's say, for example, that we need to store a 100000 x 100000 matrix (with 10e10 elements in total). If we want to read a selection mat[y:(y+1000), x:(x+9999)] using the current format, we need to perform 10000 seek operations to read these 10000 columns (so 1 seek per column). That would have a large performance hit on a SSD drive and probably be a disaster with conventional drives. However, if the data would be stored in blocks of 4096 x 4096 elements each, the same subset would require only 3 seek operations! To make that happen, code is required that can subset a matrix given these blocks, so a matrix should be registered in the format as a separate type (separate from a table), but we can still use the same compression algorithms and serialization code that is used for table's.

Interesting idea to compare this feature to the HDF5 format once it is implemented, I will check out the packages that you suggest for that!

MarcusKlik commented 7 years ago

Hi @PeteHaitch , you say multidimensional array's, would it be interesting to allow storage of 3-dimensional (or more) array's ? For that, I would need to write data-cubes or higher-dimensional data-blobs.

PeteHaitch commented 7 years ago

Thanks for your considered reply, Marcus. My initial use case was for long matrices (nrow = 10^6 - 10^9 >> ncol = 10^2 - 10^3) and accessing all columns in row-wise chunks (e.g., rows 1 - 10000, 10001 - 20000, etc.). But I definitely see your point about the number of seek operations scaling with the number of columns.

The HDF5 format is quite mature, I believe, and a fair bit of effort has gone into chunking blocks of data (https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html).

I would expect that the vast majority of usage would be for 2-dimensional arrays. I have a use case for 3-dimensional arrays but my guess is that the usage of higher dimensional arrays would decay rapidly with increasing dimensionality.

MarcusKlik commented 7 years ago

Nicely put :-), and the 'seek argument' is explained nicely in the link you specified. For an increasing number of dimensions, the number of seeks increases rapidly again. As you say, it's probably best to start with a maximum of 3 dimensions then (and work from there later on if necessary). Also, for consistency, users should be able to append rows and columns to the stored matrix, and for '>3'-dimensional arrays, that will also be much more complicated. Thanks again for the info and feature request!