Closed pcostanza closed 1 year ago
Hi Pascal, thanks for opening this PR!
One thing that pops out is that your code is that it is mixing bool/int/fp types. I initially made the same change. Then I went through the LCC codebase with Tim Davis in a meeting, where he suggested to use FP64 values everywhere in this algorithm – otherwise, we may get generic
SuiteSparse:GraphBLAS operations which use lots of typecasting. His recommendation was to set burble mode on (GxB_set (GxB_BURBLE, true) ;
), read the output and ensure that there are no generic operations.
Hi Gabor. Thanks a lot for the feedback. I changed the code to use FP64 values everywhere, as you suggested, and indeed, this leads to be significant performance improvement. Thanks a lot. (It's added to the PR.)
There are still generic calls because I'm testing against SuiteSparse:GraphBLAS 7. LAGraph seems to have issues with SuiteSparse:GraphBLAS 8 due to the removal of GrB_SelectOp. Is somebody already adapting this?
The dev branch has a fix for LAGraph that removes the GxB_SelectOp and uses a GrB_IndexUnaryOp instead.
I have tried the algorithm in different configurations. I noticed the following variations in the different versions of the algorithm that I have seen:
Sanitize: In the original algorithm, there is an attempt to keep the original matrix as long as possible, and to only transform it to binary form if really necessary (“sanitize”). However, that doesn’t work in all cases (as is stated in a comment and confirmed by experiments), so I also created a version that always performs the transformation unconditionally. (It still only removes the self edges if there are any.)
Bin method: The method to perform the actual transformation into binary form is either by an apply with a oneb operator, or by assigning 1 while using the input matrix as a structural mask.
Add method: When the matrix is not symmetric, it has to be added to its own transpose to create a symmetric matrix. This is either by an element-wise union with a true scalar, or by an element-wise addition with a oneb operator.
Reduce method: The number of elements per row needs to determined. This is either done by a matrix-vector multiply using a plus_one operator, or by a reduction using a plus operator.
Mul method: The matrix-matrix multiplication towards the end uses a plus_times_semiring by default, but can also use a plus_one semiring when the input matrix is symmetric.
I have run the benchmark for the road and the web data sets from the GAP benchmark suite, with all combinations (2^5), three times each. I don’t see significant variations in performance between the different combinations.
The strongest impact seems to be the sanitize variation: Unconditionally performing the transformation into binary form generally seems to yield better performance than trying to avoid it. (I don’t know if the performance win comes from this particular step, or if the impact is from later steps in the algorithm.) Since this also produces a version of the algorithm that is correct in all cases, it seems strongly advisable to use that one.
The other variations don’t seem to matter much. For the bin method, using the apply rather than the assign operation seems to be slightly better. For the remaining variations (add, reduce, mul), the simpler forms have a very slight tendency to perform better, but this might be because of other random effects. Nevertheless, it might then be better to prefer the simpler code (avoiding the creation of the true scalar; avoiding the zeros vector; and avoiding the distinction between different semirings for the matrix-matrix multiplication).
I have modified my pull request accordingly.
For the # of entries per row of a matrix, see LAGraph_Cached_OutDegree. That's the fastest method. It uses matrix-vector-multiply with a plus_one semiring and a dense vector. The time taken by SSGrB is O (# of nonempty rows of A) because this case triggers a special case. A dense vector x means that for v=A*x, the value of v(i) is just the # of entries in A(i,:). But I know that already so I don't have to traverse A at all.
The construction of x vector in LAGraph_Cached_OutDegree takes O(1) time and space, as well.
I revised the computation of the vector W to use this faster method. I also fixed some compiler errors in the JIT (there were syntax errors in the strings). The result is pushed to the dev branch. It works with my latest v8.0.0.draft8.
I have made some improvements to the algorithm for computing local clustering coefficients in LAGraph. Specifically:
In some simple tests, I don’t see a performance improvement, nor a degradation. But hopefully this helps us to perform further experiments with this algorithm.
Pascal