GraphBLAS / graphblas-api-cpp

GraphBLAS C++ API Specification.
https://graphblas.org/graphblas-api-cpp/
10 stars 1 forks source link

Document the parallel systems we need to address and define a supporting platform model that works for GraphBLAS #14

Open tgmattso opened 1 year ago

tgmattso commented 1 year ago

We need to document the parallel systems we must be able to support with the graphBLAS. This would include:

We need a platform model that appropriately abstracts systems composed of the above. It must deal with the complexity of the various memory spaces and support arbitrary, dynamic partitions of the above.

Finally, we need a way to deal with nonblocking GraphBLAS operations as part of a larger execution context that supports asynchronous execution. I will add a separate issue for this topic.

DrTimothyAldenDavis commented 1 year ago

I plan on starting a simple extension to address this, with a GxB_Context object. Its use is optional and will not affect results, just performance.

I will have methods as follows:

There could be a default global context that all user threads will share, say GxB_CONTEXT_WORLD analogous to MPI_COMM_WORLD. Unsure if it should be modifiable. If it is modifiable, then it would act like a call to omp_set_num_threads(...).

To use the context object:

In the future, if every GrB method had a descriptor, then perhaps that could be used instead of GxB_engage/disengage. The context could be engaged/disengaged to one or more descriptors. Workable, but tedious if you have an LAGraph method with many different descriptors, and you want one user thread to call the LAGraph method to do some work in a single context. The LAGraph method would be ugly if it had to do this, and ugly means it would be a bad design in my view.

The context should be simple: (1) create and set a Context, (2) use it implicitly in all GrB calls, then (3) disengage and free it. Attaching a context to a descriptor seems awkward and a shoe-horn to me. So at least at first, I will not connect the GrB_Descriptor to the GxB_Context.

I currently allow the user application to set the # of threads globally, or in a GrB_Descriptor. I will still keep that as an undocumented feature ("historical"), and deprecate it in favor of the GxB_Context.

I'm very close to starting work on the topics above, so any feedback now will be very useful. What I work out might be useful in a future spec as well. I must have this GxB_Context object if there will be any hope of an OpenMP parallel user application making good use of OpenMP parallelism within GraphBLAS, or making use of one or more GPUs.

One thing Tim M. didn't include in his list is how these resources will be used when aggressively exploiting non-blocking mode. I think that the context object will simplify this, but this is in the future (for me).

Say there are lots of user threads, and I'm trying to keep track of a computational DAG of GrB calls that I have not yet computed. I traverse the DAG, looking for things to optimize, execute, and so on. Not trivial, but imagine if this traversal also has to be thread-safe, with many user threads modifying it at the same time. That's a nightmare; I would need a parallel asynchronous data structure and algorithms for that. Instead, I would insist that any optimization I do must be done by a user thread with its own context object, and I would state that any matrices modified by this user thread would go into a DAG within this specific context. If multiple user threads modified matrices within this context at the same time, results would be undefined. Then I don't have to create a parallel data structure, and it will be a lot easier. Disengaging a context would imply a block, where all pending computations done while engaged must now be finished. Enganging a context would start a new empty DAG of pending computations.

I don't see the need to assign matrices / vectors / scalars to a particular context. GrB_wait is fine for synchronizing between multiple user threads.

I'm a long way from creating a DAG of GrB calls for doing kernel fusion, but I want to think ahead. With this context object, I can effectively treat any set of GrB calls as a single ordered list of calls, from a single user thread. Then I can rearrange them and fuse them more easily, with no worries about other user threads making changes while I'm analyzing the set of calls to GrB that I have so far.

DrTimothyAldenDavis commented 1 year ago

I got my draft GxB_Context object working and the results are great ... except that an older gcc compiler (v9.4.0, not very old) struggles with nested parallelism. Both gcc 12.2.0 and icx 2022 work great. See this discussion: https://github.com/GraphBLAS/graphblas-api-c/issues/74

and the results I've posted in my tagged version of SuiteSparse:GraphBLAS (v8.0.0.draft1):

https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/v8.0.0.draft.1/Demo/Program/context_demo.c https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/v8.0.0.draft.1/Demo/context_demo.out

Using this GxB_Context object to get nested parallelism is very easy. It would be harder to do the same thing with descriptors, even if all GrB methods and operations took a descriptor.

Here's what the demo looks like, simplified a bit. The code builds nmat matrices with GrB_Matrix_build, from the same I,J,X (useless, of course, since each of the matrices constructed have the same content, but it's a simple test).

    #pragma omp parallel for num_threads (nouter) schedule (dynamic, 1)
    for (int k = 0 ; k < nmat ; k++)
    {
        // each user thread constructs its own context
        GxB_Context Context = NULL ;
        GxB_Context_new (&Context) ;
        GxB_Context_set (Context, GxB_NTHREADS, ninner) ;
        GxB_Context_engage (Context) ;

        // kth user thread builds kth matrix with ninner threads
        GrB_Matrix A = NULL ;
        GrB_Matrix_new (&A, GrB_FP64, n, n) ;
        GrB_Matrix_build (A, I, J, X, nvals, GrB_PLUS_FP64) ;

        // free the matrix just built
        GrB_Matrix_free (&A) ;

        // each user thread frees its own context
        GxB_Context_disengage (Context) ;
        GxB_Context_free (&Context) ;
    }

I can explain the notion of "engage" and "disengage" for the GxB_Context. Briefly, I keep a threadprivate object, called GB_CONTEXT_THREAD, here:

https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/7752ae2d3d415c9f15b9d5c9d952b84282defe60/Source/GB_Context.c#L26

The "engage" operation simply does

GB_CONTEXT_THREAD = Context ;

and "disengage" does

GB_CONTEXT_THREAD = NULL ;

The user cannot access this threadprivate variable. There is a built-in world context, GxB_CONTEXT_WORLD, that is user visible and always non-NULL. Setting its nthreads contrls the # of threads a GrB function does if GB_CONTEXT_THREAD is NULL. If GB_CONTEXT_THREAD is not NULL, then its settings are used inside GrB instead.

SuiteSparse:GraphBLAS doesn't use nested parallelism itself, so all I have to do is use the nthreads setting from GxB_CONTEXT_WORLD or GB_CONTEXT_THREAD to control all my parallel regions, which all have a num_threads(...) clause.

DrTimothyAldenDavis commented 1 year ago

The GxB_Context also contains information on which GPU to use, so this same code could be used to compute on multiple GPUs, very easily (once we have a set of GPU kernels for GrB_Matrix_build, of course).