Concurrent write support for matrices/vectors

alexandergunnarson commented 1 year ago

Is your feature request related to a problem? Please describe. Right now, SuiteSparse:GraphBLAS matrices and vectors support concurrent reads, but writes are single-threaded only, and block reads. This necessitates workarounds for the following real-world scenarios that we're encountering:

Scenario 1: Single-threaded writes that don't block reads. To accomplish this on matrix A, it's possible to write diffs into a temporary matrix B and mutably sum B into a copy of A, but as A gets large, this is expensive both in time and space.
Scenario 2: Concurrent writes that don't block reads. Similarly to Scenario 1, it's possible to write diffs into N temporary matrices B1..Bn and mutably sum them all into a copy of A. However, this has (pretty much) the same memory complexity as Scenario 1, and presumably has worse time complexity since now there are N sums to do. (Presumably this is the case even if the aggregate amount of data in B1..Bn were the same as B in Scenario 1.)

Describe the solution you'd like I'd like to find some better alternative to the workarounds above. I realize this is an enormous undertaking, akin to saying "I'd like the performance of an array (or nearly), but I'd like it concurrency-friendly, please". Millions of person-hours have been expended in solving just that problem. However, in the process we've collectively come up with thousands of data structures and approaches along the way that might be worth exploring. Some include:

Leveraging lock striping à la java.util.ConcurrentHashMap, where certain sections of the matrix/vector are locked, but not the whole thing.
Leveraging lock-free data structures, e.g. perhaps lock-free skiplists.
Leveraging immutability and structural sharing à la HAMTs (hash-array mapped tries) or CHAMPs (compressed hash-array mapped prefix trees).
Perhaps the situation wherein an entry isn't being updated, but data is only being added/appended, could have the special characteristic of not needing a lock. That's how SingleStore in-memory tables work — inserts are lock-free, but updates to a row require a lock.

Describe alternatives you've considered See the problem description.

Additional context See above.

As always, I'm continually impressed by SuiteSparse:GraphBLAS and appreciate all the work you've put into it.

DrTimothyAldenDavis commented 1 year ago

I don't think this is possible in general. First of all, it breaks the GraphBLAS C API specification for two user threads to write to the same matrix at the same time. Second, it would destroy any kind of complex parallel algorithm that computes an output matrix C, if at the same time some other algorithm is writing to C. Sparse matrix methods typically require some kind of symbolic analysis phase first, to determine the sparsity structure of the result and to create the parallel tasks to compute it. It would not be possible for another thread to interupt that process to write into the same matrix. In other words, trying to do:

C = A*B ; // on one thread
C(2,3) = 42 ; // on another thread in parallel

would not work. The only place this would work would be scalar methods that operate on the same matrix but which just enter a single entry, like C(3,4)=42 working at the same time as C(4,4) = 99. It might also work in some uses of GrB_assign that could augment the matrix with "pending tuples". I do this in parallel internally (see the 2-phase methods in Source/GB_subassign*.c for example). Two user threads could cooperate, perhaps, and each append their own set of pending tuples.

Still, any extension like this would break the C API specification. It's outside the scope of GraphBLAS. As a result, I've designed the entire library around the concept that any GrB* function "owns" its output matrix, and it's input matrices are all immutable (sort of).

Actually, input matrices are not immutable unless GrB_wait has been called on them first. That's also in the spec. Currently, GrB_mxm (C, M, ... A, B,...) can modify all 4 matrices. If GrB_wait is called on M, A, and B first, then they become read-only and only C is modified.

What kind of scenario do you see where it's necessary to write to a matrix in parallel?

DrTimothyAldenDavis commented 1 year ago

The other problem with this idea is that it breaks any kind of future fusion. I'm able to rearrange calls to GrB* methods, delay them, fuse them, and so on. I don't do that yet, but I hope to in the future.

My exploit of non-blocking mode is currently limited to lazy deletions (zombies), lazy inserts (pending tuples), lazy sorts (where the rows/columns contain indices that are out of order), and lazy construction of the hyperhash for hypersparse matrices. But in the future, I can merge entire calls to GrB methods, and do them in a fused manner. If instead I have concurrent writes happening, it would break any hope for fusion.

DrTimothyAldenDavis commented 1 year ago

Another limitation would be the GPU kernels. Synchronizing between two threadblocks in CUDA is very difficult, and it's not now the GPU should be utilized in general. I think a CUDA implementation of concurrent writes to a matrix would be extremely difficult, as a result.

DrTimothyAldenDavis / SuiteSparse

Concurrent write support for matrices/vectors #300