DrTimothyAldenDavis / SuiteSparse

The official SuiteSparse library: a suite of sparse matrix algorithms authored or co-authored by Tim Davis, Texas A&M University.
https://people.engr.tamu.edu/davis/suitesparse.html
Other
1.15k stars 259 forks source link

Concurrent write support for matrices/vectors #300

Closed alexandergunnarson closed 1 year ago

alexandergunnarson commented 1 year ago

Is your feature request related to a problem? Please describe. Right now, SuiteSparse:GraphBLAS matrices and vectors support concurrent reads, but writes are single-threaded only, and block reads. This necessitates workarounds for the following real-world scenarios that we're encountering:

Describe the solution you'd like I'd like to find some better alternative to the workarounds above. I realize this is an enormous undertaking, akin to saying "I'd like the performance of an array (or nearly), but I'd like it concurrency-friendly, please". Millions of person-hours have been expended in solving just that problem. However, in the process we've collectively come up with thousands of data structures and approaches along the way that might be worth exploring. Some include:

Describe alternatives you've considered See the problem description.

Additional context See above.

As always, I'm continually impressed by SuiteSparse:GraphBLAS and appreciate all the work you've put into it.

DrTimothyAldenDavis commented 1 year ago

I don't think this is possible in general. First of all, it breaks the GraphBLAS C API specification for two user threads to write to the same matrix at the same time. Second, it would destroy any kind of complex parallel algorithm that computes an output matrix C, if at the same time some other algorithm is writing to C. Sparse matrix methods typically require some kind of symbolic analysis phase first, to determine the sparsity structure of the result and to create the parallel tasks to compute it. It would not be possible for another thread to interupt that process to write into the same matrix. In other words, trying to do:

C = A*B ; // on one thread
C(2,3) = 42 ; // on another thread in parallel

would not work. The only place this would work would be scalar methods that operate on the same matrix but which just enter a single entry, like C(3,4)=42 working at the same time as C(4,4) = 99. It might also work in some uses of GrB_assign that could augment the matrix with "pending tuples". I do this in parallel internally (see the 2-phase methods in Source/GB_subassign*.c for example). Two user threads could cooperate, perhaps, and each append their own set of pending tuples.

Still, any extension like this would break the C API specification. It's outside the scope of GraphBLAS. As a result, I've designed the entire library around the concept that any GrB* function "owns" its output matrix, and it's input matrices are all immutable (sort of).

Actually, input matrices are not immutable unless GrB_wait has been called on them first. That's also in the spec. Currently, GrB_mxm (C, M, ... A, B,...) can modify all 4 matrices. If GrB_wait is called on M, A, and B first, then they become read-only and only C is modified.

What kind of scenario do you see where it's necessary to write to a matrix in parallel?

DrTimothyAldenDavis commented 1 year ago

The other problem with this idea is that it breaks any kind of future fusion. I'm able to rearrange calls to GrB* methods, delay them, fuse them, and so on. I don't do that yet, but I hope to in the future.

My exploit of non-blocking mode is currently limited to lazy deletions (zombies), lazy inserts (pending tuples), lazy sorts (where the rows/columns contain indices that are out of order), and lazy construction of the hyperhash for hypersparse matrices. But in the future, I can merge entire calls to GrB methods, and do them in a fused manner. If instead I have concurrent writes happening, it would break any hope for fusion.

DrTimothyAldenDavis commented 1 year ago

Another limitation would be the GPU kernels. Synchronizing between two threadblocks in CUDA is very difficult, and it's not now the GPU should be utilized in general. I think a CUDA implementation of concurrent writes to a matrix would be extremely difficult, as a result.