Open DrTimothyAldenDavis opened 2 years ago
I'm coming across more and more graph algorithms that need to sort a vector, and some need to sort a matrix. We have some LAGraph_Sort methods but they are not sufficient.
Sorting a vector or matrix comes up in these methods:
1. using a Fiedler vector to partition a graph (I have a student working on this, in python, and the method is fast _except_ when it comes to the sort, where it is slower than it should be)
Agreed, the python sort function, while state of the art, is inherently single threaded. Also due to the Python Global Interpreter Lock (GIL) there's really no way to do good parallel sorts in pure python. A parallel sort in in the API would be very useful!
2. CDLP (community detection with label propagation): sorts each row of a matrix to compute the mode of each row.
I have a hunch the Louvain community detection method could benefit from this as well by first permuting the graph by out-degree order. I suspect having larger initial wave fronts would group larger changes per iteration, but I haven't explored this further.
I'm sure there are more uses for sorting a matrix or vector.
Maybe Cuthill McKee?
https://en.wikipedia.org/wiki/Cuthill%E2%80%93McKee_algorithm
I think we need a GrB_Vector_sort (&P, X, op, descriptor) that sorts the vector X, in place, using the GrB_BinaryOp op to compare entries (GrB_LT_FP64 would sort a vector X in ascending order, for example). The output vector X would have the same size as on input. If X had e entries on input, then on output X[0...e-1] would hold all the entries in ascending order. The output vector P would be of type GrB_UINT64 and would hold the row indices of the entries in X, in P[0..e-1]. If &P is NULL it would not be returned.
+1
A GrB_Matrix_sort would do the same, except it would sort each row of the matrix X. The output matrix P would have the same dimensions as X, and P(i, 0...e-1) would hold the row indices of the sorted X(i, 0...e-1). If GrB_DESC_T0 is used, it would sort each column of the matrix.
If op is NULL, then no sort would be done. The GrB_Matrix_sort (&P, X, NULL, NULL) would return a matrix X with all entries in X(i,:) moved to the "left", in X(i,0..e-1) if row i of X had e entries. This would allow Jeremy's query: "Give me the 4th entry in each row of a matrix". All it would require is GrB_extract to get X(:,3) and P(:,3).
Yeah I think finding the ordinal position of entries like that would be quite useful!
Could NULL
be passed in for either of the output arguments X
and P
(you mentioned it could for P
, but what about X
?)? This is traditionally two different operations, such as sort
and argsort
, and the user may only want one of them.
The "move everything to the left" operation can be useful--I use it to do parallel prefix sum. But, this seems like a completely different operation. Why shoehorn it into sort
?
We could do easily do parallel sort in Python using numba
where each row can be sorted independently.
Yes, I think either P and/or X could be NULL. Probably they should be created first and passed in as P and X, not &P or &X. And X should not always be sorted in-place, but an output matrix could be specified, say C, then if X is passed in as C, it would be done in-place.
The "move everything to the left" would be a byproduct of the sort. It would just be the sorted data itself.
The "move everything to the left" would be a byproduct of the sort. It would just be the sorted data itself.
Yeah, thanks, I didn't catch that upon first reading.
If the output could be smaller, then this could also calculate nsmallest
or nlargest
directly and more efficiently, because it wouldn't need to do the full sort.
Right. The same thing is true if you want to answer the query "In each row A(i,:), give me a matrix C where C(i,:) has just the first 5 entries in A(i,:), looking left to right". That computation is needed in the LAGraph_ConnectedComponents method. That would be best solved with a kind of GrB_select but with a different kind of operator (one that has "k" as an input, if A(i,j) is the kth entry in its row i, or in its column j, depending on the descriptor, say).
So the GrB_*sort methods wouldn't be used everywhere, but they'd be very useful and fast when a complete sort is actually needed.
Maybe Cuthill McKee? https://en.wikipedia.org/wiki/Cuthill%E2%80%93McKee_algorithm
Yes, that's a perfect example. Cuthill-McKee is a revised BFS where each level needs to be sorted by degree. This could be done one frontier at a time, in the inner loop of the LAGraph BFS. Or perhaps it could be done at the end, if the sort metric is, say, level * n + degree.
Either way, it needs a sort.
Here is a possible specification. The "implemenation-specific" comments would not be part of a spec; just what I would do in SuiteSparse:GraphBLAS. Feedback welcome:
GxB_Matrix_sort (C, P, op_lt, op_eq, A, descriptor) ; // ascending sort
GxB_Matrix_sort (C, P, op_gt, op_eq, A, descriptor) ; // descending sort
GxB_Matrix_sort: sorts each row, or each column of a GrB_Matrix A
descriptor.direction = GxB_BY_ROW or GxB_BY_COL
C or P can be NULL (but not both), which means they are not computed
P must have type GrB_INT64 or GrB_UINT64
C has any type, including user-defined, must match the type of A
(no typecasting performed)
C, P, and A must have the same dimensions
original contents of C and P are deleted
op_lt / op_gt, and op_eq are GrB_BinaryOps, and must return bool
op_lt (x,y) is x < y, and is used to indicate ascending sort
op_gt (x,y) is x > y, and is used to indicate descending sort
op_eq (x,y) is x == y
if A has a user-defined type, the user application can create
user-defined ops that return bool.
if by row, and suppose op_lt is used:
C (i, 0:k-1) contains the entries in A(i,:), in ascending order,
where k is the # of entries in A(i,:)
P (i, 0:k-1) contains the column indices of the same entries
in A(i,:)
ties in value are broken by column index
implementation-specific: C and P are returned as CSR or HyperCSR
if op_gt is used: same, except sort in descending order
if by col:
C (0:k-1, j) contains the entries in A(:,j), in ascending order,
where k is the # of entries in A(:,j)
P (0:k-1, j) contains the column indices of the same entries
in A(:,j)
ties in value are broken by row index
implementation-specific: C and P are returned as CSC or HyperCSC
if op_gt is used: same, except sort in descending order
GxB_Vector_sort (w, p, op_lt or op_gt, op_eq, u, descriptor)
same as GxB_Matrix_sort, except descriptor.direction is ignored.
w, p, and u are all vectors of the same size.
The matrix or vector could be sorted in place if called this way:
GxB_Matrix_sort (C, P, op_lt, op_eq, C, descriptor)
MATLAB has a similar feature, where the outputs B and I are like my usage of C and P for GxB_Matrix_sort.
>> help sort
SORT Sort in ascending or descending order.
B = SORT(A) sorts in ascending order.
The sorted output B has the same type and size as A:
- For vectors, SORT(A) sorts the elements of A in ascending order.
- For matrices, SORT(A) sorts each column of A in ascending order.
- For N-D arrays, SORT(A) sorts along the first non-singleton dimension.
B = SORT(A,DIM) also specifies a dimension DIM to sort along.
B = SORT(A,DIRECTION) and B = SORT(A,DIM,DIRECTION) also specify the
sort direction. DIRECTION must be:
'ascend' - (default) Sorts in ascending order.
'descend' - Sorts in descending order.
B = SORT(A,...,'MissingPlacement',M) also specifies where to place the
missing elements (NaN/NaT/<undefined>/<missing>) of A. M must be:
'auto' - (default) Places missing elements last for ascending sort
and first for descending sort.
'first' - Places missing elements first.
'last' - Places missing elements last.
B = SORT(A,...,'ComparisonMethod',C) specifies how to sort complex
numbers. The comparison method C must be:
'auto' - (default) Sorts real numbers according to 'real', and
complex numbers according to 'abs'.
'real' - Sorts according to REAL(A). Elements with equal real parts
are then sorted by IMAG(A).
'abs' - Sorts according to ABS(A). Elements with equal magnitudes
are then sorted by ANGLE(A).
[B,I] = SORT(A,...) also returns a sort index I which specifies how the
elements of A were rearranged to obtain the sorted output B:
- If A is a vector, then B = A(I).
- If A is an m-by-n matrix and DIM = 1, then
for j = 1:n, B(:,j) = A(I(:,j),j); end
The sort ordering is stable. Namely, when more than one element has the
same value, the order of the equal elements is preserved in the sorted
output B and the indices I relating to equal elements are ascending.
I've drafted two methods. It turns out the op_eq is not needed, just op_lt. https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/a10087756ba302aa9e777516bf3e0efa027cdc71/Include/GraphBLAS.h#L11172
The signature GxB_Matrix_sort (C, P, op, A, descriptor)
returns 2 matrices C and P (either one may be NULL, to indicate the result is not needed). The output C and P are described above. op is a GrB_BinaryOp, and to sort double values in ascending order, GrB_LT_FP64 can be used. The 2 inputs to the op must have the same type, and the output of op must be boolean. You can sort with typecasting, so that GrB_LT_INT64, for example, when applied to a matrix A of type GrB_FP64, would first cast the entries to INT64. The comparison would be
bool lessthan = (int64_t) a < (int64_t) b
where a and be are double, say, and are entries from A. I've drafted and partially tested GxB_Matrix_sort and GxB_Vector_sort. The Matrix sort can sort either rows or columns of A. A can also be sorted in place, with GxB_Matrix_sort (A, ... op, A, ...).
GxB_Matrix_sort and GxB_Vector_sort are now fully documented and tested. They will appear in the pre-release v5.2.0.alpha14 and v6.0.0.alph14 shortly.
I'm coming across more and more graph algorithms that need to sort a vector, and some need to sort a matrix. We have some LAGraph_Sort methods but they are not sufficient.
Sorting a vector or matrix comes up in these methods:
I'm sure there are more uses for sorting a matrix or vector.
I think we need a GrB_Vector_sort (&P, X, op, descriptor) that sorts the vector X, in place, using the GrB_BinaryOp op to compare entries (GrB_LT_FP64 would sort a vector X in ascending order, for example). The output vector X would have the same size as on input. If X had e entries on input, then on output X[0...e-1] would hold all the entries in ascending order. The output vector P would be of type GrB_UINT64 and would hold the row indices of the entries in X, in P[0..e-1]. If &P is NULL it would not be returned.
A GrB_Matrix_sort would do the same, except it would sort each row of the matrix X. The output matrix P would have the same dimensions as X, and P(i, 0...e-1) would hold the row indices of the sorted X(i, 0...e-1). If GrB_DESC_T0 is used, it would sort each column of the matrix.
I don't think this method could take a mask or accumulator, since it has two outputs P and X.
If op is NULL, then no sort would be done. The GrB_Matrix_sort (&P, X, NULL, NULL) would return a matrix X with all entries in X(i,:) moved to the "left", in X(i,0..e-1) if row i of X had e entries. This would allow Jeremy's query: "Give me the 4th entry in each row of a matrix". All it would require is GrB_extract to get X(:,3) and P(:,3).
Computing the median or mode of each row of a matrix after a sort would be simple, but this might need yet another GrB_* method. Perhaps GrB_Matrix_statistic (&C, P, X, method, descriptor) would take in the output P and X of a GrB_Matrix_sort, and return a vector C with the mode, median, etc (chosen by the method parameter). This needs some more thought, but it's clear to me that some kind of sorting mechanism is needed to sort the rows/cols of a GrB_Matrix, or the entries of a single GrB_Vector.