Design: What to do with IRV for sparse arrays

e-kayrakli commented 5 years ago

Sparse arrays stores irv and expose it via a proc IRV to keep track of the "Implicitly Replicated Value".

Obviously, 0 is the most common value for this value. This is such a common case that today our LinearAlgebra module just silently assumes that this is the case. So, if you multiply two matrices whose IRV's are non-zero, then you'll get a wrong answer. The particular LinearAlgebra problem can be solved with adding some checks in the library and falling back to naive implementation if necessary, however, this would incur some (probably marginal) overhead.

Furthermore, philosophically speaking, a sparse array with a non-zero IRV is closer to a dense array from many operational perspectives (except for storage) than a sparse array in today's world. So, maybe IRV should be part of the type of a sparse array somehow.

I see multiple possible approaches:

Make IRV a param on sparse arrays. In today's syntax it may not be directly possible to create a sparse array with a non-default IRV, then we can have a factory function for those arrays. I think they are special enough that they can be defined with a different syntax.
Make IRV a param on sparse domains. This can work better with today's syntax: sparse subdomain(parentDom, IRV=5);. However, precludes having sparse arrays sharing the same domain have different IRV values.
Remove IRV altogether. 3a. Add a separate type/module that allows to create arrays with replicated values for storage purposes.
Keep it as is. Overhaul all the sparse math we have to make sure we do the right thing.

I personally don't know any use case for IRV!=0, and would view such arrays as "dense arrays that you can probably store in some smart way if they get huge". So, 3 is where I stand today. But learning some use cases for it can make me switch to 1 or 2 easily. I don't really want 4 but I can live with it if that's the consensus.

e-kayrakli commented 5 years ago

Ping @ben-albrecht and @bradcray as I had discussed this issue with them in the past.

bradcray commented 5 years ago

I thought the linear algebra library was designed to use its own "Sparse Matrix" values that the user had to request through its interface rather than accepting any general sparse Chapel arrays? In which case, doesn't it have full control over what the IRV is? (i.e., is this actually a non-issue?)

Philosophically, I don't buy in the slightest that if a sparse array's IRV is something other than zero, it's not sparse. I might buy that argument for the term "sparse matrix," but definitely not for a sparse array. In my understanding and opinion, sparsity isn't a mathematical notion, it's an implementation approach that can be applied to anything that can benefit from it. If an array of 1,000,000 files has 99.9% of them set to /dev/null or an array of string names has 99.9% of them set to "", it would be a mistake to store that information densely.

Because Chapel doesn't support param values of arbitrary types (yet), I don't think the IRV can be required to be specified as a param.

I definitely don't think we should remove the feature either.

e-kayrakli commented 5 years ago

I thought the linear algebra library was designed to use its own "Sparse Matrix" values that the user had to request through its interface rather than accepting any general sparse Chapel arrays? In which case, doesn't it have full control over what the IRV is? (i.e., is this actually a non-issue?)

There is no "Sparse Matrix" values in the linear algebra module. There are factory functions that create standard sparse arrays and domains. And functions accept any sparse arrays/domains.

Philosophically, I don't buy in the slightest that if a sparse array's IRV is something other than zero, it's not sparse. I might buy that argument for the term "sparse matrix," but definitely not for a sparse array. In my understanding and opinion, sparsity isn't a mathematical notion, it's an implementation approach that can be applied to anything that can benefit from it. If an array of 1,000,000 files has 99.9% of them set to /dev/null or an array of string names has 99.9% of them set to "", it would be a mistake to store that information densely.

Because Chapel doesn't support param values of arbitrary types (yet), I don't think the IRV can be required to be specified as a param.

I definitely don't think we should remove the feature either.

I can see where you stand is different from where I stand philosophically. I don't think as strongly as you are and I am not going to argue further. If anything, IRV is a feature that brings flexibility to sparse arrays. I can learn to live with it.

We can have a separate discussion for making it param, when we can.

I am keeping the issue open just to give chance to people to comment if they feel strongly one way or another. I'll close it in favor of option 4 if nobody objects.

bradcray commented 5 years ago

There is no "Sparse Matrix" values in the linear algebra module. There are factory functions that create standard sparse arrays and domains. And functions accept any sparse arrays/domains.

I'm not an expert in the Linear Algebra module, but my question would be: Does the linear algebra module promise/guarantee that it will operate on arbitrary sparse arrays that are passed in, or does it only promise to work with those that are created from its factory methods? Based on early design discussions, I would've imagined the latter even though we may not be preventing arbitrary sparse (2D) arrays from being passed in today. Specifically, at times, we've talked about having the library represent sparse arrays by wrapping them in a record or other approaches, and I would guess we wouldn't want to preclude changing the library's expected representation of sparse matrices down the road, so make no promises that you can pass your arbitrary sparse array in. But again, I don't know.

ben-albrecht commented 5 years ago

Based on early design discussions, I would've imagined the latter even though we may not be preventing arbitrary sparse (2D) arrays from being passed in today.

Your intuition / recollection is correct. It is a design goal of the Linear Algebra module to be flexible w.r.t. domains of the input arrays, including offset and stride. However, sparse support started from a perspective of: "We support CSR matrices with IRV=0 only". This past release has started extending support to COO matrices, but the library doesn't do anything smart with IRV yet. I don't think it checks IRV anywhere, which I consider and oversight -- we should at least be generating an error when IRV!=0.

Keep it as is. Overhaul all the sparse math we have to make sure we do the right thing.

I think it would be reasonable to not support IRV!=0 sparse matrices for a while, or only include naive implementations, e.g. treat IRV!=0 sparse matrix as dense in the context of matrix multiplication.

I am personally leaning towards option (3a), as I am not aware of any important use-cases that require IRV. Having it as a part of the core sparse array interface puts a strain on libraries such as the linear algebra module that aim to be domain-agnostic. Also, it doesn't look like scipy or matlab support this feature out of the box, which I consider a high standard for linear algebra / sparse array interfaces.

bradcray commented 5 years ago

To be clear, I think it's completely reasonable for the L.A. library to only support IRV == 0 because I think that's what makes sense in the context of sparse matrices. I'm just not convinced that that makes sense for sparse arrays more generally. (Also, note that IRV is a property of the array, not the domain, so you could still make L.A. domain-agnostic while only creating / accepting matrices with IRV == 0). For me, the problem with 3 / 3a is "What is the behavior of myDenseArr = mySparseArr for an arbitrary eltType for the indices in mySparseArr.domain that are not in myDenseArr.domain. Those indices must have some value associated with them, and the IRV is intended to represent that value.

e-kayrakli commented 5 years ago

For me, the problem with 3 / 3a is "What is the behavior of myDenseArr = mySparseArr for an arbitrary eltType for the indices in mySparseArr.domain that are not in myDenseArr.domain. Those indices must have some value associated with them, and the IRV is intended to represent that value.

I see the indices that are in myDenseArr.domain but not in mySparseArr.domain as simply nonexistent and they do not have any value associated with it (in a mathematical context 0 represents those values implicitly). Therefore, it feels intuitive to me that myDenseArr=mySparseArr would result in those indices filled by whatever the default value for eltType is.

ben-albrecht commented 5 years ago

To be clear, I think it's completely reasonable for the L.A. library to only support IRV == 0 because I think that's what makes sense in the context of sparse matrices.

@bradcray - Agreed. Aside from the small performance penalty in adding runtime checks that IRV==0, I think your comment addresses my concern:

Having it as a part of the core sparse array interface puts a strain on libraries such as the linear algebra module that aim to be domain-agnostic.

I would still be interested in understanding the motivating use-case(s) behind supporting IRV in the first place.

bradcray commented 5 years ago

Aside from the small performance penalty in adding runtime checks that IRV==0

The point I was trying to make above was that if the L.A. library is documented to say that it's only allowed to be called with sparse matrices that are created with its factory functions, then there's no need to add the checks because the IRV could only be zero since doing otherwise is a user error (at their own risk). You could add the check anyway to be nice to users who abuse that rule, and then disable them with --no-checks to avoid the performance penalty with --fast. That said, if we're really worried about the time an integer or floating point comparison takes, we've got some really impressive L.A. routines.

I would still be interested in understanding the motivating use-case(s) behind supporting IRV in the first place.

What would you propose the "zero" for a sparse array of string or file or MyRecord or MyClass be? Or would you say that you're not allowed to create sparse arrays of any types that don't have a numerical "0" value?

It sounds like Engin would propose:

whatever the default value for eltType is.

I agree that the default value for eltType is a reasonable default IRV (and in fact, it is), but it seems unnecessarily restrictive to me given the small amount of space the IRV takes to store and the fact that the values in an array don't necessarily bear any relationship to a given type's default value. I.e., that approach would only permit my sparse array of string names to store the empty string as its zero value rather than "John/Jane Doe" and my empty array of files could only store... [whatever the default initialized file value is] rather than a file referring to /dev/null, say. If I had a sparse array of n x n matrices, the default value of the arrays used to represent those matrices would necessarily be "all zeroes" whereas it might be more appropriate in my application to store the IRV matrix as "the n x n identity matrix".

I see the indices that are in myDenseArr.domain but not in mySparseArr.domain as simply nonexistent

I think it's wrong to say that they're nonexistent. They exist, they're just not explicitly represented. The code:

const D = {1..3, 1..3};
var A: [1..3, 1..3] real = 1.0;
var SD: sparse subdomain(D) = [i in 1..3] (i,i);
var SA: [SD] real = 2.0;

A = SA;

ought to change A to store:

2.0 0.0 0.0
0.0 2.0 0.0
0.0 0.0 2.0

not:

2.0 1.0 1.0
1.0 2.0 1.0
1.0 1.0 2.0

Similarly, given two sparse arrays with different sparsity patterns SA and SB, A = SA + SB; ought to change all 9 A elements to store the logical 3x3 result matrix representing the sum of the 3x3 arrays represented by SA and SB; it shouldn't only change the indices of A that represent the intersection or union of SA and SB, say.

I.e., though represented compactly, the sparse array SA still represents a conceptual 3x3 array value that should overwrite all 9 elements of A rather than only touching the 3 diagonal values of A.

e-kayrakli commented 5 years ago

@bradcray

I see the indices that are in myDenseArr.domain but not in mySparseArr.domain as simply nonexistent and they do not have any value associated with it

I should rephrase this. The indices do exist in the sparse domain, but values for them do not exist in the array using those domains.

So for the examples that you have given, I agree about the desired behavior completely and do not see them as contradictory to my view. But my intuition is when we want to read a value form sparse array index that doesn't exist, we simply use the default value for the element type. And thus, the above examples all work as expected.

For particular examples with string sparse arrays as "John Doe" IRV, or path sparse arrays with "/dev/null" IRV: I believe handling such use cases is not necessarily up to the sparse array implementation. Rather, the application should create a type with such default values and use the sparse array of that type.

bradcray commented 5 years ago

Rather, the application should create a type with such default values and use the sparse array of that type.

A downside of this is that my array of names could no longer simply be an array of strings—I have to go through extra work to create a new type when I really just want a string. Another is that I couldn't have two sparse arrays of the same type where one's IRV is "John Doe" and the others is "Jane Doe".

Mostly on this issue, it's not really clear to me what problem we're trying to fix. What aspect of the current behavior is broken or lacking or causing problems?

lydia-duncan commented 1 year ago

We've decided not to stabilize sparse arrays for 2.0, removing the 2.0 label

chapel-lang / chapel

Design: What to do with IRV for sparse arrays #14281