Improvements for local clustering coefficients

pcostanza commented 1 year ago

I have made some improvements to the algorithm for computing local clustering coefficients in LAGraph. Specifically:

I added a simple implementation, to allow the test code to check against a baseline for all inputs, not only for west0067.mtx. The code is based on the description at https://en.wikipedia.org/wiki/Clustering_coefficient
I changed the algorithm itself to accept LAGraph_Graph input instead of GrB_Matrix input, and the algorithm now uses LAGraph properties to check whether the graph is symmetric, or has self edges.
I made some minor modifications to the algorithm itself to try to look at the structure of the input rather than the values.

In some simple tests, I don’t see a performance improvement, nor a degradation. But hopefully this helps us to perform further experiments with this algorithm.

Pascal

szarnyasg commented 1 year ago

Hi Pascal, thanks for opening this PR! One thing that pops out is that your code is that it is mixing bool/int/fp types. I initially made the same change. Then I went through the LCC codebase with Tim Davis in a meeting, where he suggested to use FP64 values everywhere in this algorithm – otherwise, we may get generic SuiteSparse:GraphBLAS operations which use lots of typecasting. His recommendation was to set burble mode on (GxB_set (GxB_BURBLE, true) ;), read the output and ensure that there are no generic operations.

pcostanza commented 1 year ago

Hi Gabor. Thanks a lot for the feedback. I changed the code to use FP64 values everywhere, as you suggested, and indeed, this leads to be significant performance improvement. Thanks a lot. (It's added to the PR.)

There are still generic calls because I'm testing against SuiteSparse:GraphBLAS 7. LAGraph seems to have issues with SuiteSparse:GraphBLAS 8 due to the removal of GrB_SelectOp. Is somebody already adapting this?

DrTimothyAldenDavis commented 1 year ago

The dev branch has a fix for LAGraph that removes the GxB_SelectOp and uses a GrB_IndexUnaryOp instead.

pcostanza commented 1 year ago

I have tried the algorithm in different configurations. I noticed the following variations in the different versions of the algorithm that I have seen:

Sanitize: In the original algorithm, there is an attempt to keep the original matrix as long as possible, and to only transform it to binary form if really necessary (“sanitize”). However, that doesn’t work in all cases (as is stated in a comment and confirmed by experiments), so I also created a version that always performs the transformation unconditionally. (It still only removes the self edges if there are any.)
Bin method: The method to perform the actual transformation into binary form is either by an apply with a oneb operator, or by assigning 1 while using the input matrix as a structural mask.
Add method: When the matrix is not symmetric, it has to be added to its own transpose to create a symmetric matrix. This is either by an element-wise union with a true scalar, or by an element-wise addition with a oneb operator.
Reduce method: The number of elements per row needs to determined. This is either done by a matrix-vector multiply using a plus_one operator, or by a reduction using a plus operator.
Mul method: The matrix-matrix multiplication towards the end uses a plus_times_semiring by default, but can also use a plus_one semiring when the input matrix is symmetric.

I have run the benchmark for the road and the web data sets from the GAP benchmark suite, with all combinations (2^5), three times each. I don’t see significant variations in performance between the different combinations.

The strongest impact seems to be the sanitize variation: Unconditionally performing the transformation into binary form generally seems to yield better performance than trying to avoid it. (I don’t know if the performance win comes from this particular step, or if the impact is from later steps in the algorithm.) Since this also produces a version of the algorithm that is correct in all cases, it seems strongly advisable to use that one.

The other variations don’t seem to matter much. For the bin method, using the apply rather than the assign operation seems to be slightly better. For the remaining variations (add, reduce, mul), the simpler forms have a very slight tendency to perform better, but this might be because of other random effects. Nevertheless, it might then be better to prefer the simpler code (avoiding the creation of the true scalar; avoiding the zeros vector; and avoiding the distinction between different semirings for the matrix-matrix multiplication).

I have modified my pull request accordingly.

DrTimothyAldenDavis commented 1 year ago

For the # of entries per row of a matrix, see LAGraph_Cached_OutDegree. That's the fastest method. It uses matrix-vector-multiply with a plus_one semiring and a dense vector. The time taken by SSGrB is O (# of nonempty rows of A) because this case triggers a special case. A dense vector x means that for v=A*x, the value of v(i) is just the # of entries in A(i,:). But I know that already so I don't have to traverse A at all.

DrTimothyAldenDavis commented 1 year ago

The construction of x vector in LAGraph_Cached_OutDegree takes O(1) time and space, as well.

DrTimothyAldenDavis commented 1 year ago

I revised the computation of the vector W to use this faster method. I also fixed some compiler errors in the JIT (there were syntax errors in the strings). The result is pushed to the dev branch. It works with my latest v8.0.0.draft8.

GraphBLAS / LAGraph

Improvements for local clustering coefficients #185