I would like to bring up the discussion of chunking again, as something to be addressed in v2.
I see two main strategies here:
Chunk the storage level arrays
Chunk the sparse arrays
I'm think chunking the arrays themselves is the way to go – as it enables more flexibility. I think this is also a pretty safe option, as others have gone for it in the past (see: "prior art" at bottom).
I'm going to go through what the current set-up looks like, and how these two approaches could work. I haven't thought a ton about how these approaches work with n dimensional or hypersparse formats.
@eriknw and @jim22k, would be interested to hear your perspective from experience with metagraph and dask-graphblas especially.
Above is a representation of the different chunking strategies. Each strategy depicts the storage arrays for a CSR or CSC matrix. Color denotes a "logical chunk" : a range of values along the compressed dimension, e.g. contiguous set of rows for CSR. The solid line illustrates how the arrays are subdivided into chunks.
Fixed size chunks
First is fixed size chunks in storage, basically what is being done right now. Here, chunk sizes are fixed for each storage array. This means that "logical chunks" have no set correspondence with storage chunks.
Advantages
It already exists, no further work needed
Disadvantages
For any selection of rows, I need to read the pointer array before being able to know which chunks of the indice and data array I should retrieve.
For any selection of rows, it is very unlikely the storage chunk boundaries will coincide with logical chunk boundaries.
For distributed processing, this means each node will almost always be retrieving more data than it needs. How much data is tunable with chunk size – though this tuning is not straightforward.
For single node iterative processing, this significantly complicates the code. In practice I have needed to manage my own chunk cache. Knowing that each row sits within one chunk would remove that need.
Logical chunking of storage arrays
One solution here is to make the storage chunk boundaries match up with the logical boundaries. A major downside is that this requires the storage backend to allow variable sized chunks. This exists for arrow, and for hdf5 via virtual datasets (example). I would like this ability to exist in zarr v3 (see: ZEP 3), but it doesn't yet.
Advantages
Very similar to fixed chunk storage, a lot of code that works on one representation should "just work" on the other
This could be orthogonal to this specification, though in practice I think you'd want to tell people if they have simple access to their chunked data.
Disadvantages
Limited by chunking model of underlying store. Possible in HDF5 via virtual, not yet possible in zarr.
Keep limitations of existing formats (e.g. still optimized for "entire row", "entire column" selection)
Logical chunking of sparse arrays
The third solution is to copy the chunking solutions from dense arrays. That is, we do not rely on the array stores native chunking mechanisms, and instead manage chunks of sparse arrays ourselves.
In this model, a chunked sparse array is a collection of other sparse arrays which would need to be concatenated to recreate an in memory un-chunked representation.
This solution isn't necessarily mutually exclusive with "Logical chunking of storage arrays", and the two could complement each other if each sparse array chunk is large enough to have storage array level chunking.
Advantages
"Free" stacking. I can create a group with appropriate metadata, then symlink (or kerchunk) in the appropriate sparse array groups, and now I have a chunked sparse array, created without needing to duplicate or read from arrays.
Can represent index spaces larger than index dtype.
Not just optimizing on row/ column selection. E.g. could use CSR, but have some chunks along column dimension. Which is useful for sparse arrays with block structure. Can be useful for block structure and sparse data with spatial interpretation.
Disadvantages
More complicated implementation
More expensive conversion to in-memory arrays
Enough rope to hang ourselves with
Easy to make it even more complicated (e.g. hilbert curve, kd-tree like), enough that implementation becomes difficult
Different "chunks" are different kinds of sparse array
I would like to bring up the discussion of chunking again, as something to be addressed in v2.
I see two main strategies here:
I'm think chunking the arrays themselves is the way to go – as it enables more flexibility. I think this is also a pretty safe option, as others have gone for it in the past (see: "prior art" at bottom).
I'm going to go through what the current set-up looks like, and how these two approaches could work. I haven't thought a ton about how these approaches work with n dimensional or hypersparse formats.
@eriknw and @jim22k, would be interested to hear your perspective from experience with
metagraph
anddask-graphblas
especially.Above is a representation of the different chunking strategies. Each strategy depicts the storage arrays for a CSR or CSC matrix. Color denotes a "logical chunk" : a range of values along the compressed dimension, e.g. contiguous set of rows for CSR. The solid line illustrates how the arrays are subdivided into chunks.
Fixed size chunks
First is fixed size chunks in storage, basically what is being done right now. Here, chunk sizes are fixed for each storage array. This means that "logical chunks" have no set correspondence with storage chunks.
Advantages
Disadvantages
Logical chunking of storage arrays
One solution here is to make the storage chunk boundaries match up with the logical boundaries. A major downside is that this requires the storage backend to allow variable sized chunks. This exists for
arrow
, and for hdf5 via virtual datasets (example). I would like this ability to exist in zarr v3 (see: ZEP 3), but it doesn't yet.Advantages
Disadvantages
Logical chunking of sparse arrays
The third solution is to copy the chunking solutions from dense arrays. That is, we do not rely on the array stores native chunking mechanisms, and instead manage chunks of sparse arrays ourselves.
In this model, a chunked sparse array is a collection of other sparse arrays which would need to be concatenated to recreate an in memory un-chunked representation.
This solution isn't necessarily mutually exclusive with "Logical chunking of storage arrays", and the two could complement each other if each sparse array chunk is large enough to have storage array level chunking.
Advantages
kerchunk
) in the appropriate sparse array groups, and now I have a chunked sparse array, created without needing to duplicate or read from arrays.Disadvantages
Prior art (far from complete)