GraphBLAS / binsparse-specification

A cross-platform binary storage format for sparse data, particularly sparse matrices.
https://graphblas.org/binsparse-specification/
BSD 3-Clause "New" or "Revised" License
16 stars 4 forks source link

Chunked sparse arrays (for v2) #16

Open ivirshup opened 1 year ago

ivirshup commented 1 year ago

I would like to bring up the discussion of chunking again, as something to be addressed in v2.

I see two main strategies here:

I'm think chunking the arrays themselves is the way to go – as it enables more flexibility. I think this is also a pretty safe option, as others have gone for it in the past (see: "prior art" at bottom).

I'm going to go through what the current set-up looks like, and how these two approaches could work. I haven't thought a ton about how these approaches work with n dimensional or hypersparse formats.

@eriknw and @jim22k, would be interested to hear your perspective from experience with metagraph and dask-graphblas especially.

Pasted image 20230302165058

Above is a representation of the different chunking strategies. Each strategy depicts the storage arrays for a CSR or CSC matrix. Color denotes a "logical chunk" : a range of values along the compressed dimension, e.g. contiguous set of rows for CSR. The solid line illustrates how the arrays are subdivided into chunks.

Fixed size chunks

First is fixed size chunks in storage, basically what is being done right now. Here, chunk sizes are fixed for each storage array. This means that "logical chunks" have no set correspondence with storage chunks.

Advantages

Disadvantages

Logical chunking of storage arrays

One solution here is to make the storage chunk boundaries match up with the logical boundaries. A major downside is that this requires the storage backend to allow variable sized chunks. This exists for arrow, and for hdf5 via virtual datasets (example). I would like this ability to exist in zarr v3 (see: ZEP 3), but it doesn't yet.

Advantages

Disadvantages

Logical chunking of sparse arrays

The third solution is to copy the chunking solutions from dense arrays. That is, we do not rely on the array stores native chunking mechanisms, and instead manage chunks of sparse arrays ourselves.

In this model, a chunked sparse array is a collection of other sparse arrays which would need to be concatenated to recreate an in memory un-chunked representation.

This solution isn't necessarily mutually exclusive with "Logical chunking of storage arrays", and the two could complement each other if each sparse array chunk is large enough to have storage array level chunking.

Advantages

Pasted image 20230302181617

Disadvantages

Prior art (far from complete)

ivirshup commented 1 year ago

@eriknw, any chance you took a picture of the white board from our discussion yesterday?