GraphBLAS / binsparse-specification

A cross-platform binary storage format for sparse data, particularly sparse matrices.
https://graphblas.org/binsparse-specification/
BSD 3-Clause "New" or "Revised" License
16 stars 4 forks source link

On (not) storing level names #19

Open eriknw opened 1 year ago

eriknw commented 1 year ago

When going to N-dimensional sparse tensors (#2), our group has many different ways to think about and label the levels. Nevertheless, we more-or-less agree on the underlying arrays of indices and pointers and how they relate. So, the purpose of this issue is to discuss the possibility of not naming the levels and instead rely on the names of the arrays and other metadata.

This issue may also help determine what we call the pointers arrays in version 1 of the binsparse spec.

Consider DCSR. Right now, we call the arrays [indices_0, pointers_0, indices_1]. In Finch.jl (and TACO), the arrays would be [pointers_0, indices_0, pointers_1, indices_1], and we can drop pointers_0 here to end up with [indices_0, pointers_1, indices_1]. Clearly, the pointers array between indices_0 and indices_1 connects the two indices, so one can argue for naming it pointers_0 or pointers_1 depending on ones perspective.

Proposal: let's use more descriptive names for pointers, such as having DCSR be [indices_0, pointers_0_to_1, indices_1]. I'm open to having different names for pointers as long as it captures "from 0", "to 1", or "from 0 and to 1" in some way.

For me, making the pointers array names more explicit makes reasoning about the sparse structure much easier even without having level names. So, what do people think about not naming levels and relying on array names instead? Some examples:

DCSR: [indices_0, pointers_0_to_1, indices_1]
CSR: [pointers_0_to_1, indices_1]
COO: [indices_0, indices_1]

Formally, these are the rules for valid collections of arrays for sparse tensors:

BEGIN        -> indices_0
BEGIN        -> pointers_{i}_to_{i+1}
indices_{i}  -> indices_{i+1}
indices_{i}  -> pointers_{j}_to_{j+1} where i <= j < ndim - 1
pointers_{i} -> indices_{i+1}
indices_{i}  -> END

where 0 <= i < ndim

Hopefully we can explain this clearly in docs.

Another possibility is to group indices together such as [indices_0_1] for COO, or even [indices_0_1, pointers_1_to_2, indices_2] for a data structure that Tim doesn't like.

ELL-like compression complicates this somewhat. For example, if all the rows in CSR or non-empty rows in DCSR had the same number of elements, then pointers_0_to_1 would be [0, k, 2*k, 3*k, ...] and we wouldn't need to save the array if we simply saved k in the metadata. One possibility is to have metadata for ranges (start, stop, step), so it could actually be possible to always have all arrays [indices_0, pointers_0_to_1, indices_1, ..., pointers_{n-1}_to_n, indices_n] as either metadata or stored arrays.

A possible challenge with this approach is to allow custom/extension levels such as DIA or VBL as in Finch.jl.

eriknw commented 1 year ago

@willow-ahrens, can you share an example using VBL? What are the arrays? Thanks!

willow-ahrens commented 1 year ago

I believe that level formats, in general, are useful. I think they make it easier to understand and implement the format, and make the format more extensible. I also think that what is being proposed here can be understood as another, different level format, and might benefit from being understood as such.

When many people are trying to describe something and they all reach for a similar but different concept (in this case, similar but different level abstractions), I think that is evidence that the concept in general was quite useful. I suspect that it will be easier to explain an N-dimensional BinSparse format with a level format abstraction than without.

Even if the levels BinSparse formalizes are different from the ones that Finch uses, I would prefer to convert named BinSparse levels to named Finch levels over mapping combinations of array keys in the hdf5 group to level formats. #20 includes an example Binsparse parser to make discussion of "easy to write" more concrete.

Finally, I believe that the main utility of introducing a level format abstraction is to allow for a formal interface between one level and another. This interface guarantees that if a future user implements some new level format (VBL, for example), we know that it will interoperate with existing BinSparse levels because it conforms to the specified interface at the boundary between levels.

While most of our discussion has been about names of levels, consider the underlying discussion of the interfaces between them. In #20, I have proposed that the interface should be that of arrays-of-arrays. It may be helpful to understand this issue (#19) as a different level format. We can then ask what interface between arrays is being proposed, whether that interface is easy to describe and parse, and whether we believe it will be easy to extend in the future.