DrTimothyAldenDavis / GraphBLAS

SuiteSparse:GraphBLAS: graph algorithms in the language of linear algebra. For production: (default) STABLE branch. Code development: ask me for the right branch before submitting a PR. video intro: https://youtu.be/Tj5y6d7FegI .
http://faculty.cse.tamu.edu/davis/GraphBLAS.html
Other
364 stars 63 forks source link

Advocating for Aggregator objects to perform reductions #33

Open eriknw opened 3 years ago

eriknw commented 3 years ago

I think there is a strong case to be made to add Aggregator objects to GraphBLAS to perform reductions. In software, it's powerful to give names to common operations, or, as is the case for aggregations, to reuse names users are already familiar with. This hides implementation details behind a cleaner API that better captures what the user wants to do instead of how to do it. I expect this would result in cleaner, more portable, and more performant code.

Such a change to a GraphBLAS library (or the spec) should, of course, be done with care. Generic user-defined aggregations are complex, and systems that implement them are often too simple and lack functionality, performance, or both. Expressive, performant user-defined aggregations are necessarily complicated to support (imho), and we shouldn't lose sight of this. Having Aggregator objects allows a library to incrementally support more and more advanced aggregations (if desired), the details of which are hidden behind a nice API and opaque types. But one shouldn't start advanced. Let's begin simple.

I think SuiteSparse:GraphBLAS can actually get really far by reusing its existing machinery. Most aggregations people care about do use monoids during a reduction step, which can be done with your O(1) dense vectors and suitable semirings. Many aggregations should be "simple" recipes for SuiteSparse (I quote "simple" so as to not make light of it--SuiteSparse is anything but simple!).

To help myself think through this, I have assembled a spreadsheet that describes how some common aggregations (and a few uncommon ones) may be performed in GraphBLAS. It's incomplete, but I think it's fairly illustrative:

https://docs.google.com/spreadsheets/d/1dQNrxLBVq8Y3JOdvNnIXRPZsPTDaf7lydztiG7Gbx8E/edit?usp=sharing

I split the aggregations into six groups of increasing complexity or likely difficultness to implement. The 0th group (not included) are the aggregations that are already well-handled by reductions with monoids, such as SUM, PROD, ANY, ALL, MIN, MAX, etc. These could, of course, be rolled into Aggregators.

Group 1

// Group 1 (blue)
struct Aggregator {
    // May define init_value, init_unaryop, or neither
    // If none are defined, then an O(1) dense vector with irrelevant data can be used in a matrix-vector multiply.
    TYPE *init value,  // Optional; e.g., the values of an O(1) dense vector.
    UnaryOp *init_unaryop,  // Optional; e.g., is applied to the input matrix at the beginning.
    BinaryOp *update_binaryop,  // Optional; e.g., the BinaryOp in a matrix-vector multiply.
                                // If not provided, then `first` may be assumed such as `plus_first(A @ dense)`
    Monoid merge_monoid  // Required; e.g., the Monoid in a matrix-vector multiply.
                         // If only this is provided, then it is equivalent to existing reduce with Monoid.
};

Group 2

// Group 2 (green)
// Add finalize.
struct Aggregator {
    TYPE *init value,
    UnaryOp *init_unaryop,
    BinaryOp *update_binaryop,
    Monoid merge_monoid,
    UnaryOp *finalize  // Optional; is applied to the output at the very end.
};

Group 3

// Group 3 (yello)
// Same fields as group 2, but requires tuple types or composite aggregations.

Group 4

// Group 4 (orange)
// Add init_indexunaryop.
struct Aggregator {
    // May define init_value, init_unaryop, init_indexunaryop and init_value, or none of these.
    TYPE *init value,
    UnaryOp *init_unaryop,
    IndexUnaryOp *init_indexunaryop,  // Optional; for indices; uses `init_value` as thunk value.
    BinaryOp *update_binaryop,
    Monoid merge_monoid,
    UnaryOp *finalize
};

Group 5

// Group 5 (red)
// Add merge_binaryop and terminate_unaryop, and don't use update_binaryop.
// Exactly one of merge_binaryop or merge_monoid must be defined.
struct Aggregator {
    TYPE *init value,
    UnaryOp *init_unaryop,
    IndexUnaryOp *init_indexunaryop,
    BinaryOp *update_binaryop,
    BinaryOp *merge_binaryop,  // elements must be processed in order,
                               // but can still be decomposed into groups and done in parallel
    Monoid *merge_monoid,  // this is now optional
    UnaryOp *terminate_unaryop,  // call on each aggregation value to check whether we can terminate
    UnaryOp *finalize
};

Group 6

// Group 6 (brown)
// Add update_indexunaryop, and allow use of update_binaryop.
struct Aggregator {
    // init options
    TYPE *init value,
    UnaryOp *init_unaryop,
    IndexUnaryOp *init_indexunaryop,
    // update options
    BinaryOp *update_binaryop,
    IndexUnaryOp* update_indexunaryop,  // use previous agg value as thunk value
    // merge options
    BinaryOp *merge_binaryop,
    Monoid *merge_monoid,
    // final options
    UnaryOp *terminate_unaryop,
    UnaryOp *finalize
};

In group 5, we used init_unaryop or init_indexunaryop with merge_binaryop, but didn't define any update options. In this case, the init operation is applied to every element, and the elements are merged together (while maintaining order) with merge_binaryop.

In group 6, we use an init operation, an update operation, and merge_binaryop. This means if we split the elements being aggregated into sublists (for parallel processing), then the init operation is only called on the first element in each sublist, and the update operation is used for all the rest.

More behaviors and fields could be added. For example, if an update operation is specified without specifying a merge operation, then this could mean the elements must be processed in order, serially (i.e., no within-group parallelism!). For another example, terminate_value could be added to serve the same purpose as terminate_unaryop.

Known shortcomings:

I'll be honest, the thing I like most about this proposal is COUNT. I really want a canonical, portable way to calculate row degrees and column degrees 😃 .

Please let me know if anything needs more explaining (or corrected). I intend this issue to be as much for the GraphBLAS committee and community as for SuiteSparse.

Cheers! 🍻

DrTimothyAldenDavis commented 3 years ago

Some of the things you're looking for are already in the pipeline and will be available soon. In particular the COUNT will be done far faster in v5.x, but not using an aggregator.

The COUNT in Group 1 can already be done in the current GraphBLAS using the PLUS_PAIR semirings and GrB_mxv. It takes O(n + nvals(d)) time and memory if d is the resulting vector, in GraphBLAS v4.0.3, and will take O(nvals(d)) time and memory in v5.x once I add uniform-valued matrices and vectors. The other methods in Group 1 can be handled with different operators, if I added them or if they were done as user-defined operators. COUNT_NONZEROS in each row of A, for example, would be a PLUS_FIRSTISNONZ semiring with d=A*y and y is a dense vector, where

FIRSTISNONZ_type(x,y) = (type) ((x != 0) ? 1:0)

If A m-by-n is held by row and you want to reduce it to a vector d of length m, by 'summing' up each row, then it becomes a dot-product based GrB_mxv, which is O(nvals(d)) time for COUNT if done by GrB_reduce, and will be the same time in GrB_mxv in v5.x once I add uniform-valued matrices and vectors (so the dense vector y takes O(1) time and memory to build). If you want to reduce the columns, it becomes GrB_vxm, which is a saxpy-based method, and I use atomics to do the monoid, so it's still fast, efficient, and parallel, at O(nvals(d)) time. (when I say "time" I mean "work", not "time", to be precise; it would be O(nvals(d)/p) time on p threads, assuming no significant hash collisions).

SUM_OF_SQUARES is the PLUS_POW semiring with a dense vector y whose entries are all equal to 2. I already have the POW binary operator so PLUS_POW could easily be added as a built-in method in my Source/Generated folder so it would be very fast.

For COUNT_NONZERO, the time would be O(nvals(A)), as opposed to O(nvals(d)) for COUNT; both are asymptotically as fast as theoretically possible, and this will appear in v5.0.0 or v5.1.0. Uniform-valued matrices/vectors is 2nd on my TODO list, after I fix the MATLAB interface for R2021a (the MATLAB interface to v4.x and v5.x dies when using MATLAB R2021a, since R2021a includes its own v3.3.3 to do the built-in C=A*B, which conflicts with v4.x+).

None of the aggregators you suggest, except the basic ones that are already monoids, can be done with atomic updates; they are by design sequential, with each aggregator owned by a single thread. They would be quite slow if you have a CSR matrix A and you want to aggregate down the columns of A (I'm not sure how my hash+atomics method for the hypersparse case would extend to the aggregators; it would be quite difficult to do). If you had a vector of n aggregators, it would take O(n+nvals(d)) time and memory and would make for a very slow aggregator but a very fast method based on GrB_vxm (taking O(nvals(d)) time). If n is huge, this is a big difference. The implementation would need hashed table of dynamically-created aggregators ... Then I'd need to give each of these its own spin-lock/critical section, or enforce sequential access somehow. It's a very daunting prospect to code.

I see that more expressive reductions could be useful, but I hesitate to consider methods that have impose a fundamentally sequential computation into GraphBLAS, like updating a single aggregator. I think it would be better to start simple, and see how far we can take the existing monoid/semiring idea (such as COUNT and all other methods in Group 1, which I can almost already do asymptotically fast, and will do so in v5.x), and go from there.

eriknw commented 3 years ago

I think it would be better to start simple, and see how far we can take the existing monoid/semiring idea

I agree with this sentiment. I think there is value in having Aggregator objects, so I can begin by adding aggregators to grblas that will translate to the appropriate calls to SuiteSparse:GraphBLAS. I can do the work in anticipation of O(1) dense vectors. This should be informative.

I updated SUM_OF_SQUARES and HYPOT to use POW as you suggested (my bad!).

jim22k commented 3 years ago

@eriknw I really like this proposal. Abstracting the complexity away from the user so they can treat aggregations as easy as monoids for reduction is a win for readability. It also allows GraphBLAS to be a more complete solution for general purpose sparse linear algebra where these kinds of aggregators would be expected to exist (especially simple ones like mean, std, count, etc).

DrTimothyAldenDavis commented 3 years ago

The complexity of the aggregators in a parallel algorithm inside GraphBLAS is daunting, particularly when you consider reducing in the "wrong" direction (say aggregating the columns when the matrix is stored by row).

I can already do the basic ones, with GrB and sometimes GxB methods. mean is just sum(A)./degree(A), and degree(A) is just d = Ax or d=A'x with x = ones(n,1) and the PLUS_PAIR semirings. So then harmonic mean, std, variance, kurtosis, skew, and so on can easily be done too, at least if multiple calls to GrB are done. I call your count(A) the degree(A), so that's easy. The only GxB that is required is the PAIR operator, to compute the count or degree.

argmin and argmax can be done with the GxB* positional operators, and the FIRST and LAST can be done that way too. You write:

ARGMIN and ARGMAX don't naturally generalize to Matrix to Scalar reduction.

but in fact they do. It just takes several calls to existing methods, plus my GxB* positional binary ops. Something like this works for argmax applied to the columns of the matrix A:

function [s,i] = argmax (A) % argmax, using GraphBLAS % does not handle matrices with nan's very well % if column j has no entries, MATLAB returns s(j)=0 and i(j)=1, but % this method returns s(j) and i(j) as empty entries.

% find the min in each column s = max (A) ;

% locate where each max occurs in each column (there might be duplicates) S = diag (s) ; L = GrB.mxm (A, 'any.eq.double', S) ;

% drop the zeros in L L = GrB.prune (L) ;

% find the position of the max entry in each column [m,n] = size (A) ; x = ones (1,n) ; % x = GrB.ones (1,n) in the next release i = GrB.mxm (x, 'any.secondi1.int64', L) ;

with the test code:

rng ('default') ; A = GrB.random (10, 10, 0.2) [s,i] = argmax (A) [s2,i2] = max (double (A))

The only reduction/aggregator in that list that I don't know how to do with the current GrB and GxB methods is "ismonotonically*". Also the mode would be difficult to do. These kind of methods do not lend themselves to implementation with a monoid.

What is ptp and logaddexp?

On Wed, Jun 9, 2021 at 2:31 PM Jim Kitchen @.***> wrote:

@eriknw https://github.com/eriknw I really like this proposal. Abstracting the complexity away from the user so they can treat aggregations as easy as monoids for reduction is a win for readability. It also allows GraphBLAS to be a more complete solution for general purpose sparse linear algebra where these kinds of aggregators would be expected to exist (especially simple ones like mean, std, count, etc).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DrTimothyAldenDavis/GraphBLAS/issues/33#issuecomment-858034951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYIIOKYFA2EJHL5JZONWKDTR66PLANCNFSM43FBPYIA .

DrTimothyAldenDavis commented 3 years ago

Here is the code with better format:


function [s,i] = argmax (A)
% argmax, using GraphBLAS
% does not handle matrices with nan's very well
% if column j has no entries, MATLAB returns s(j)=0 and i(j)=1, but
% this method returns s(j) and i(j) as empty entries.

% find the max in each column
s = max (A) ;

% locate where each max occurs in each column (there might be duplicates)
S = diag (s) ;
L = GrB.mxm (A, 'any.eq.double', S) ;

% drop the zeros in L
L = GrB.prune (L) ;

% find the position of the max entry in each column
[m,n] = size (A) ;
x = ones (1,n) ; % x = GrB.ones (1,n) in the next release
i = GrB.mxm (x, 'any.secondi1.int64', L)  ;

with the test code:

rng ('default') ;
A = GrB.random (10, 10, 0.2)
[s,i] = argmax (A)
[s2,i2] = max (double (A))

I'm using the secondi1 operator since MATLAB expects that indices start with 1, not zero. So for GraphBLAS proper, use the secondi not secondi1.

DrTimothyAldenDavis commented 3 years ago

The vector x is an "iso-valued" dense vector of all ones, and will take O(1) time and memory to create in v5.1, so this will work nicely for hypersparse matrices. It already works in my draft code. I can do:

n = 2^60
H (1:10,1:10) = A
[s,i] = argmax (H)

and get the same result, except that s and i are vectors of dimension 2^60 instead of 10, in this case with just 8 entries each.

DrTimothyAldenDavis commented 3 years ago

assuming I change the statment x=ones(1,n) to x=GrB.ones(1,n) that is.