Document the parallel systems we need to address and define a supporting platform model that works for GraphBLAS

tgmattso commented 1 year ago

We need to document the parallel systems we must be able to support with the graphBLAS. This would include:

Multi-core, multi-CPU in a shared address space. Explicit management of NUMA features of a system is critical
Single GPU ... basic Host/Device model with disjoint host/device memories and Uniform Shared Memory (USM)
Multi GPU ... Host/Device model with disjoint memories and USM
Arbitrary accelerators instead of GPUs (aside: An accelerator is restricted to a fixed API, unlike a GPU which is programmable)
Shared nothing distributed systems with nodes composed of the above

We need a platform model that appropriately abstracts systems composed of the above. It must deal with the complexity of the various memory spaces and support arbitrary, dynamic partitions of the above.

Finally, we need a way to deal with nonblocking GraphBLAS operations as part of a larger execution context that supports asynchronous execution. I will add a separate issue for this topic.

DrTimothyAldenDavis commented 1 year ago

I plan on starting a simple extension to address this, with a GxB_Context object. Its use is optional and will not affect results, just performance.

I will have methods as follows:

GxB_Context_new (&c): create a new context c. It is not yet engaged (see below).
GxB_Context_free (&c): free a context. c must either be already disengaged, or perhaps this also implies a disengage (see below).
GxB_Context_set (c, field, value): set a field to a particular value. I will have fields to set # of CPU threads, which GPU(s) to use, a chunk factor for the CPU (for small problems, # threads used is limited by the ratio of work to do divided by the chunk factor), and a chunk for the GPU too, perhaps.
GxB_Context_get (...): ditto, but retrieve a value.

There could be a default global context that all user threads will share, say GxB_CONTEXT_WORLD analogous to MPI_COMM_WORLD. Unsure if it should be modifiable. If it is modifiable, then it would act like a call to omp_set_num_threads(...).

To use the context object:

GxB_Context_engage (c): assign the context c to threadprivate space owned by the user thread that calls this method. If GraphBLAS is not compiled with OpenMP, then I could use pthread thead-local storage. Then all subsequent calls to GraphBLAS by this user thread will use this context c to know what computational resources to use. Unsure if a context can be modified while it is engaged. Seems safe, since it is mean to be owned by a single user thread, but I haven't worked out the details.
GxB_Context_disengage (c): remove the context c from use by this user thread. This does not free c. Subsequent calls to GraphBLAS operations by this user thread are OK (they would use GxB_CONTEXT_WORLD instead).

In the future, if every GrB method had a descriptor, then perhaps that could be used instead of GxB_engage/disengage. The context could be engaged/disengaged to one or more descriptors. Workable, but tedious if you have an LAGraph method with many different descriptors, and you want one user thread to call the LAGraph method to do some work in a single context. The LAGraph method would be ugly if it had to do this, and ugly means it would be a bad design in my view.

The context should be simple: (1) create and set a Context, (2) use it implicitly in all GrB calls, then (3) disengage and free it. Attaching a context to a descriptor seems awkward and a shoe-horn to me. So at least at first, I will not connect the GrB_Descriptor to the GxB_Context.

I currently allow the user application to set the # of threads globally, or in a GrB_Descriptor. I will still keep that as an undocumented feature ("historical"), and deprecate it in favor of the GxB_Context.

I'm very close to starting work on the topics above, so any feedback now will be very useful. What I work out might be useful in a future spec as well. I must have this GxB_Context object if there will be any hope of an OpenMP parallel user application making good use of OpenMP parallelism within GraphBLAS, or making use of one or more GPUs.

One thing Tim M. didn't include in his list is how these resources will be used when aggressively exploiting non-blocking mode. I think that the context object will simplify this, but this is in the future (for me).

Say there are lots of user threads, and I'm trying to keep track of a computational DAG of GrB calls that I have not yet computed. I traverse the DAG, looking for things to optimize, execute, and so on. Not trivial, but imagine if this traversal also has to be thread-safe, with many user threads modifying it at the same time. That's a nightmare; I would need a parallel asynchronous data structure and algorithms for that. Instead, I would insist that any optimization I do must be done by a user thread with its own context object, and I would state that any matrices modified by this user thread would go into a DAG within this specific context. If multiple user threads modified matrices within this context at the same time, results would be undefined. Then I don't have to create a parallel data structure, and it will be a lot easier. Disengaging a context would imply a block, where all pending computations done while engaged must now be finished. Enganging a context would start a new empty DAG of pending computations.

I don't see the need to assign matrices / vectors / scalars to a particular context. GrB_wait is fine for synchronizing between multiple user threads.

I'm a long way from creating a DAG of GrB calls for doing kernel fusion, but I want to think ahead. With this context object, I can effectively treat any set of GrB calls as a single ordered list of calls, from a single user thread. Then I can rearrange them and fuse them more easily, with no worries about other user threads making changes while I'm analyzing the set of calls to GrB that I have so far.

DrTimothyAldenDavis commented 1 year ago

I got my draft GxB_Context object working and the results are great ... except that an older gcc compiler (v9.4.0, not very old) struggles with nested parallelism. Both gcc 12.2.0 and icx 2022 work great. See this discussion: https://github.com/GraphBLAS/graphblas-api-c/issues/74

and the results I've posted in my tagged version of SuiteSparse:GraphBLAS (v8.0.0.draft1):

https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/v8.0.0.draft.1/Demo/Program/context_demo.c https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/v8.0.0.draft.1/Demo/context_demo.out

Using this GxB_Context object to get nested parallelism is very easy. It would be harder to do the same thing with descriptors, even if all GrB methods and operations took a descriptor.

Here's what the demo looks like, simplified a bit. The code builds nmat matrices with GrB_Matrix_build, from the same I,J,X (useless, of course, since each of the matrices constructed have the same content, but it's a simple test).

    #pragma omp parallel for num_threads (nouter) schedule (dynamic, 1)
    for (int k = 0 ; k < nmat ; k++)
    {
        // each user thread constructs its own context
        GxB_Context Context = NULL ;
        GxB_Context_new (&Context) ;
        GxB_Context_set (Context, GxB_NTHREADS, ninner) ;
        GxB_Context_engage (Context) ;

        // kth user thread builds kth matrix with ninner threads
        GrB_Matrix A = NULL ;
        GrB_Matrix_new (&A, GrB_FP64, n, n) ;
        GrB_Matrix_build (A, I, J, X, nvals, GrB_PLUS_FP64) ;

        // free the matrix just built
        GrB_Matrix_free (&A) ;

        // each user thread frees its own context
        GxB_Context_disengage (Context) ;
        GxB_Context_free (&Context) ;
    }

I can explain the notion of "engage" and "disengage" for the GxB_Context. Briefly, I keep a threadprivate object, called GB_CONTEXT_THREAD, here:

https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/7752ae2d3d415c9f15b9d5c9d952b84282defe60/Source/GB_Context.c#L26

The "engage" operation simply does

GB_CONTEXT_THREAD = Context ;

and "disengage" does

GB_CONTEXT_THREAD = NULL ;

The user cannot access this threadprivate variable. There is a built-in world context, GxB_CONTEXT_WORLD, that is user visible and always non-NULL. Setting its nthreads contrls the # of threads a GrB function does if GB_CONTEXT_THREAD is NULL. If GB_CONTEXT_THREAD is not NULL, then its settings are used inside GrB instead.

SuiteSparse:GraphBLAS doesn't use nested parallelism itself, so all I have to do is use the nthreads setting from GxB_CONTEXT_WORLD or GB_CONTEXT_THREAD to control all my parallel regions, which all have a num_threads(...) clause.

DrTimothyAldenDavis commented 1 year ago

The GxB_Context also contains information on which GPU to use, so this same code could be used to compute on multiple GPUs, very easily (once we have a set of GPU kernels for GrB_Matrix_build, of course).

GraphBLAS / graphblas-api-cpp

Document the parallel systems we need to address and define a supporting platform model that works for GraphBLAS #14