Matrix reduce to Vector with FIRST or SECOND

eriknw commented 3 years ago

Per the spec, one can pass a BinaryOp to GrB_Matrix_reduce_BinaryOp to reduce the rows or columns of a Matrix to a Vector. SuiteSparse 4.0 changed so that only BinaryOps that correspond to known Monoids may be used. FIRST and SECOND do not have Monoids, therefore they cannot be used.

Nevertheless, I expected FIRST and SECOND to work in SuiteSparse. The intent is clear, and I suspect efficient implementations exist.

(Note: per the spec, the outcome of a non-commutative or non-associative binary operator is undefined. FIRST and SECOND are non-commutative--they commute to each other--so their behavior would also be undefined with the current spec. SuiteSparse often implies an ordering when applying BinaryOps and chooses to be well-defined).

DrTimothyAldenDavis commented 3 years ago

The algorithms I'm using for reduce-to-vector would break if the operator is not associative and commutative. FIRST and SECOND wouldn't work at all. Slower methods could work, but then I would have to have two sets of reduce-to-vector methods: one where the operator is well-behaved, and a slower one where it isn't. That's a lot of extra code to support.

eriknw commented 3 years ago

Thanks. I understand this is the case for generic binary operators. If the internal data formats are sorted, I would expect there to be efficient specialized implementations for FIRST and SECOND.

Would you be okay with creating special implementations for these, or do you think this would get too messy? If you're okay with a specialized approach, would this be a good "first issue" to implement for new contributors?

eriknw commented 3 years ago

fwiw, I can't think of a meaningful example when restricted to the current spec where a Monoid can't be used instead of a BinaryOp. So, I think it's fine that you don't fully support reduce with BinaryOp, and perfectly understandable if you don't want to add special code to support a couple edge cases (in this case, FIRST and SECOND).

A better, longer term approach may be to address some of the deficiencies of using BinaryOps (which there are several). In particular, it would be great to have generic machinery in place that can perform a reduction with FIRST or SECOND in parallel. I believe this is doable, but we'd need to determine an API.

For another example, suppose one created a user-define type that is a struct. With generic, parallel machinery in place, I would expect one could perform a reduction to e.g. add all the elements of one of the fields in the struct.

DrTimothyAldenDavis commented 3 years ago

I think the API would be different. There's a similar issue that arises in the CC problem, where an assignment C(I) max= A is done, using the max as the accumulator operator. But in this case, if the index vector has duplicates, they want to apply the accum operator to reduce down to a final result. This is not something the C API for GraphBLAS defines. Using FIRST and SECOND for the accum operator is not unlike the kind of reduction you want to do, if C is a vector and A a matrix. These two cases are different, but they are related in changing what it means to do a reduction.

eriknw commented 3 years ago

That's an interesting case with "assign" with duplicate indices. I see you discuss this (or something very similar) in section 9.9. If I understand correctly, I think what you want here is an explicit "scatter" operation (to a list of indices) that accepts a binaryop/monoid to perform the reduction when values are scattered to the same index. This reduction could happen before the accumulation into the output, so, per normal GraphBLAS operations, accum could be a different binop. Fwiw, I would love to have a functional scatter as well where a function is used to determine which index a value is scattered to.

For interested readers (e.g., my future self): "build" is another place where reduction is kind of special. Could one expect to get better performance if a Monoid were accepted instead of a BinaryOp? SuiteSparse does accept FIRST and SECOND here and is well-defined where the spec is undefined if the binop is not associative or commutative.

DrTimothyAldenDavis commented 3 years ago

That's right ... "build" is special.

I have to accept FIRST and SECOND here, because I also use "build" to sum up my "pending tuples". If there are two assignments, in two calls to GrB_assign, the first gets overwritten by the second. So if the two calls to GrB_assign both leave pending tuples around, I have an "implied SECOND" operator to assemble the duplicates.

here the FIRST and SECOND are defined by which call to GrB_assign came first, so it's very well defined.

If an accum operator is used in the two calls to GrB_assign, then this becomes my "dup" operator for "build", and it cannot be assumed to be associateve / commutative.

On Fri, Feb 26, 2021 at 5:41 PM Erik Welch notifications@github.com wrote:

That's an interesting case with "assign" with duplicate indices. I see you discuss this (or something very similar) in section 9.9. If I understand correctly, I think what you want here is an explicit "scatter" operation (to a list of indices) that accepts a binaryop/monoid to perform the reduction when values are scattered to the same index. This reduction could happen before the accumulation into the output, so, per normal GraphBLAS operations, accum could be a different binop. Fwiw, I would love to have a functional scatter as well where a function is used to determine which index a value is scattered to.

For interested readers (e.g., my future self): "build" is another place where reduction is kind of special. Could one expect to get better performance if a Monoid were accepted instead of a BinaryOp? SuiteSparse does accept FIRST and SECOND here and is well-defined where the spec is undefined if the binop is not associative or commutative.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DrTimothyAldenDavis/GraphBLAS/issues/28#issuecomment-786951632, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYIIOLXDL73HUSWJ42RHP3TBAWQ7ANCNFSM4YI6NRFA .

DrTimothyAldenDavis commented 3 years ago

I think what you really need is not a reduce-to-vector with FIRST and SECOND. That is ill-defined for any other binary op. What you really need is a kind of GxB_select that can depend on the entry count in a given row or column. "Give me just the first 4 entries in each row, starting from the lowest column indices". That step appears in CC, but GraphBLAS can't do it so the LAGraph CC exports the matrix and does it itself.

eriknw commented 3 years ago

It's good to bring select into the conversation, but I really would like a better reduction with binaryop. For example, performing a reduction on one field of a user-defined type that is a struct.

To do such a generic reduction well (and able to be done in parallel, so I will be talking about doing reductions in groups), I think at least four things are needed:

Y = unaryop(x): initialize a reduction with the first argument in a group. The output type Y may be different from the input type X.
Y = binaryop(Y, X): perform a step in the reduction within a group. The first argument is the previous result of the reduction within the group, and the second argument is the current value in the matrix.
Y = combine(Y, Y): a binaryop or monoid to combine results from groups.
is_commutative: whether the reduction commutes or not. If it doesn't commutes, then the combine steps must occur in order. Otherwise, groups may be combined in any order. This could probably be inferred/implied by whether the combine operator is a binaryop or monoid.

For example:

group1 = binop(binop(binop(unary(x1), x2), x3), x4)
group2 = binop(binop(binop(unary(x5), x6), x7), x8)
final = combine(group1, group2)

A further enhancement (that I'm not pushing for, but think is important to be aware of) would be to allow the type Y to be malloced, which would first happen in the unaryop. If it were malloced, then one would also need a finalize unaryop to perform the free and convert to a normal (fixed-size) GraphBLAS type. For completeness-sake, if malloc were used, then one could also provide a function to serialize and deserialize the object to send over the wire if necessary. Although complicated, this functionality would also be extremely powerful.

If order is important (i.e., for reductions that don't commute), then there is a natural order to do things: along a column, along a row, or along a vector. For matrix reduce to scalar, we'd need to choose a convention to linearize the elements (for example, like it were C-contiguous or row-wise), and setting the transpose descriptor on the input would choose the other option for linearizing a matrix.

FIRST and SECOND make perfect sense in view of this style of reduction. For these, the unaryop would simply be the identity, and the combine operator would be the same as the reducing binaryop (FIRST or SECOND).

If any of this functionality were available, I would surely use it. I don't need it though, and I think having this conversation is more important atm.

eriknw commented 3 years ago

I like that your select variant has a clean API that is easy to understand. Instead of operating on the absolute index I or J of an element, it operates on, say, ith and jth, which is the count of number of elements previously seen along this row, column, or vector. It's perhaps a little awkward if both ith and jth are needed. Otherwise, the API can match your select or the new select. I don't have a feel for how common or useful such an operation would be.

Regarding reduction on user-defined binaryops again, there are many more operations that could be done with my proposal above when combined with user-defined types:

Get up to the first k elements.
- it may be difficult to stop the iteration once k elements are seen.
Get a random k elements.
Determine if rows are monotonically increasing or decreasing.
Calculate approximate percentiles.
Calculate approximate count distinct.
Compute a checksum.
Compute the sum and count at the same time
- If there were a finalize step at the end, this could then compute the mean.
Similarly, if there were a finalize, one could first compute the count, sum(x), and sum(x**2), and then calculate the standard deviation from these all in a single pass.
- We can take this further and use it to compute a linear regression!
Lots more: power mean, geometric mean, central mean (two passes), log-sum-exp, skewness (two passes), etc.

It is common in modern data analytics system to have user-defined aggregation functions in a form similar to initialize-update-merge-evaluate. Variations exist, and I should point to literature or documentation...

DrTimothyAldenDavis commented 3 years ago

The GxB_select method isn't a good place to do any kind of reduction or aggregation. An algorithm to do that would be a very different one than what GxB_select currently does. The method is point-wise, in the sense that the select op only depends on one value A(i,j) and its position. The result for one A(i,j) can't depend on values nearby, currently.

The idea of a select op that can take as input an integer k where the entry in question A(i,j) is the kth entry in its row, could be very useful but I'm struggling with trying to define it clearly so it can be implemented efficiently. If the select op takes in 2 integers, say k and s for an entry A(i,j) that is the kth entry in its row and sth entry in its column, then this would be very very hard to implement efficiently. I don't have that information in the matrix. To compute it, I would have to transpose the matrix, and then operate on both the matrix and its transpose at the same time. That would be very very slow ... and if the select op is user-defined, I might not know that the selectop doesn't need the value s anyway, so all this extra work is useless. And it is a lot of extra work.

So for now, I don't see how I can add a selectop that depends on both k and s.

Maybe just k ... but what if the matrix is held by column and k is the ordinal position in the row? I'd need a costly transpose. That at least is more feasible, however, than depending on both k and s.

Computing a mean or mode in a given row is really not well suited to GxB_select. The mean can be done with GrB_mxv already, or reduce-to-vector (which is the same thing actually). Even sum(x^2) could be done, to sum up the squares of entries in each row. That could be done with a single call to GrB_mxv with the right semiring. The monoid is plus, and the multiplicative operator would be f(x,y) = x^2. I don't have that operator as a built-in, but it could be added as a user-defined op. Currently, user-defined ops and types are slow ... but my plan is to eventually allow them to be all just as fast as built-in ops, by using a jit.

The "finalize" step is out of scope of GrB_mxv; it can be done quickly as a GrB_eWiseMult with the div operator and a vector holding the degree.

Computing the mode or median of each row is very problematic. It's not expressible as a semiring, nor as a reduce-to-vector operation.

On Fri, Apr 9, 2021 at 12:50 PM Erik Welch @.***> wrote:

I like that your select variant has a clean API that is easy to understand. Instead of operating on the absolute index I or J of an element, it operates on, say, ith and jth, which is the count of number of elements previously seen along this row, column, or vector. It's perhaps a little awkward if both ith and jth are needed. Otherwise, the API can match your select or the new select. I don't have a feel for how common or useful such an operation would be.

Regarding reduction on user-defined binaryops again, there are many more operations that could be done with my proposal above when combined with user-defined types:

Get up to the first k elements.

it may be difficult to stop the iteration once k elements are seen.

Get a random k elements.

Determine if rows are monotonically increasing or decreasing.

Calculate approximate percentiles.

Calculate approximate count distinct.

Compute a checksum.

Compute the sum and count at the same time

If there were a finalize step at the end, this could then compute the mean.

Similarly, if there were a finalize, one could first compute the count, sum(x), and sum(x**2), and then calculate the standard deviation from these all in a single pass.

We can take this further and use it to compute a linear regression!

Lots more: power mean, geometric mean, central mean (two passes), log-sum-exp, skewness (two passes), etc.

It is common in modern data analytics system to have user-defined aggregation functions in a form similar to initialize-update-merge-evaluate. Variations exist, and I should point to literature or documentation...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DrTimothyAldenDavis/GraphBLAS/issues/28#issuecomment-816851276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYIIOPKAAQ3TZHFAOTLA7DTH446FANCNFSM4YI6NRFA .

eriknw commented 3 years ago

Thanks for taking the time to reply. I think I can do a better job explaining what I consider most important in this issue, so please indulge me as I summarize below. This is a long comment, so please take as much time as you like.

Summary

Reductions with BinaryOps (esp. user-defined ones) as currently in the spec have serious deficiencies.
SuiteSparse:GraphBLAS does a reasonable thing and only supports reductions with Monoids or builtin BinaryOps that have associated Monoids.
Even though I know (2), I was surprised that reducing a Matrix to a Vector with FIRST or SECOND didn't work (we can call this a user error--my bad).
Stronger than (3), I believe reductions with user-defined BinaryOps can be extremely powerful and is likely to be useful for some graph algorithms (and certainly useful for a general sparse linear algebra library).
Hence, I think it's worth the time to try to come up with a better approach to perform user-defined reductions not limited to Monoids.
I proposed an approach that I think fits within GraphBLAS and that I think addresses many of the existing shortcomings.
I would like feedback. Is this approach feasible? Implementable? Mathematically sound? Complete? Too much? Can it be improved? Should I spend the time to link to supporting references?
Is there any interest in me creating new function signatures or PoC implementations of my proposal?
I will be watching for examples where this may come in handy!

The Proposal

The goal:

support reductions with BinaryOps (including user-defined ones)
that can express many operations (more than can be expressed with matrix multiply with different semirings)
and can be parallelized efficiently.

Reduction methods include Matrix to Vector, Matrix to Scalar, Vector to Scalar, and building Matrices and Vectors with dup_op (this last one may require separate attention). Because order may matter during the reduction, I will call the group of elements that is being reduced over a list (if order didn't matter, I would call this a multiset). Since we require the reduction to be parallelizable, we allow the list of elements to be partitioned into sub-lists that can be operated on independently.

Instead of a single BinaryOp, we require three operators:

Initialize with UnaryOp: applied to the first item in a sub-list.
- output domain may be anything (i.e., it is not limited to be coercible to or from the input domain)
Update with BinaryOp: operates on the previous result and the next item in the sub-list.
- output domain should match the output domain of UnaryOp.
Merge with BinaryOp or Monoid: combines results from sub-lists.
- If BinaryOp: sub-list results must be combined in order (i.e., doesn't commute), and updates with BinaryOp must operate in order.
- Input and output domains should match.
- If Monoid: sub-list results may be combined in any order, and updates with BinaryOp may operate in any order.
- Associativity is assumed for BinaryOp and Monoid; e.g., tree reductions are allowed.
- Instead of having variants for both options--BinaryOp and Monoid--another option would be to only support BinaryOp with a boolean flag whether the operation commutes or not (i.e., needs to operate in order or not).

If there are k elements in a list being reduced, the UnaryOp may be called between 1 to k times (i.e., once per sub-list). The expectation is that the list will be partitioned into enough sub-lists to take advantage of parallelism. The UnaryOp will probably be the identity fairly often, but it may also be a relatively expensive operation, so we don't want to always call it on every element.

It's worth noting that there is no notion of identity of the missing elements or of the BinaryOp. Such an identity is strictly for Monoids and doesn't apply here. If the list being reduced is empty, then the result is empty. If it only has one element, then the result is the result of the UnaryOp.

I realize this would be a lot of extra code to support. I'm not asking you to implement it (a JIT would be much more amazing!), but I do think it's worth having the conversation and keeping an eye open for use cases.

Misc.

Yeah, I understand the behavior and pitfalls of the variant of select you brought up. I'll let you know if I find any more use cases for it.

I wasn't asking for mode or median. I was saying that if I were able to perform reductions with user-defined BinaryOps as per my proposal, then I could use it to efficiently compute approximate percentiles. And lots of other things in a single reduction operation.

Going back a few comments, you described a situation that comes up in CC. If I understand correctly, I recognize this operation as a "scatter" with aggregation. I would love it if an astute reader of this issue would give this more thought!

I'm sorry for repeating myself. There's been a lot of good/interesting information in this thread, and I wanted to refocus :)

DrTimothyAldenDavis / GraphBLAS