GraphBLAS / graphblas-api-c

Other
7 stars 3 forks source link

GxB_Context is essential: telling each GrB* method/operation which/how many OpenMP threads and GPU(s) to use #74

Open DrTimothyAldenDavis opened 1 year ago

DrTimothyAldenDavis commented 1 year ago

We're making progress on the CUDA kernels for SuiteSparse:GraphBLAS.

The change to the C API will be very slight. See the new "cuda" branch on LAGraph, which is now synced with the latest LAGraph dev branch. It has a few changes the the LAGr_TriangleCount but those are just for testing.

A new mode:

GrB_init (GxB_NONBLOCKING_GPU) ;
GrB_init (GxB_BLOCKING_GPU) ;

For now, when the GPU is in use, SuiteSparse:GraphBLAS will ignore the malloc/calloc/realloc/free pointers given to it by GxB_init. Instead, it will always use rmm_wrap_malloc, rmm_wrap_calloc, etc, in the SuiteSparse/GraphBLAS/rmm_wrap folder. Those are C-callable wrappers for the Rapids Memory Manager. LAGr_Init will need to be given the rmmwrap* methods, however.

That's it, so far. But I need more.

We need more control over when the GPU is used, and which GPU is used. In particular, I have 4 GPUs on my system. If a user application spawns 4 user threads, each could use its own GPU. (For now, our CUDA kernels exploit only a single GPU).

So how do I tell GrB_mxm to use, say, "cuda device 3"? I can't use the descriptor, since not all methods have a descriptor: GrB_wait, GrB_Matrix_build, GrB_Matrix_dup, GrB_Matrix_nvals, and so on, must all be told which GPU to use. GrB_Matrix_nvals needs to know this because it may need to do GrB_wait, which may do a lot of work.

For GrB_wait, GrB_Matrix_dup, and GrB_Matrix_nvals, we could try to keep track of which GPU goes with a particular matrix. We may want to that anyway, but it's awkward in general, since it doesn't extend well to GrB_mxm.

Enter the GxB_Context object, which I think solves this problem. Ideally this object should be passed to ALL GraphBLAS calls, for example:

GrB_mxm (context, C, M, accum, semiring, A, B, descriptor)

Something like that would be ideal, but adding a new parameter to each and every GrB and GxB call would be very disruptive. My solution is to add a GxB_Context object, and to place it in the user's thread-local-storage (threadprivate), just like what we did when GrB_error and GrB_wait had no input parameters in v1.0 of the C API.

// constructs a new context, also placing it into threadprivate storage
GxB_Context_new (&context) ; 

// To make all subsequent calls by this user thread use cuda device 3:
GxB_Context_set (context, GxB_GPU_DEVICE, 3) ; 

// To make all subsequent calls by this user thread use 4 openmp threads
GxB_Context_set (context, GxB_NTHREADS, 4) ;

// To free the user's threadprivate context object, either:
GxB_Context_free (&context) ;
GrB_free (&context) ;
// Freeing a context also sets the user's threadprivate context to NULL

And so on. I would still allow for my GxB_Global_Options_set to control the global number of openmp threads. So the precedence would be for a particular call to a GrB or GxB method or operation:

(1) if a GrB_Descriptor for a particular call has a non-default setting for # of openmp threads, or which (or how many) GPUs to use, then those settings are used. Not applicable for GrB_wait, GrB_dup, GrB_build, etc, which sadly don't have a descriptor. If the GrB_descriptor is present and has non-default settings, then the context (2) and global settings (3) are ignored.

(2) if the threadprivate context object exists (not NULL) and if it has non-default settings for # of openMP threads, which or how many GPUs to use, etc, then those settings are used, and the global settings (3) are ignored.

(3) otherwise, use the global settings, which apply to all calls to GrB and GxB from all user threads. This defaults to omp_get_max_threads for OpenMP, and (perhaps) "cuda device zero" or "no cuda device will be used" as a default.

This GxB_Context would make a small change to the API, and it would not break backward compatibility with the v2.0 C API.

I may also want to add hints to tell a matrix or vector where to live, as in:

GxB_set (A, GxB_GPU_DEVICE, 3) ;

that would give GraphBLAS a hint that the GrB_Matrix A would like to live on GPU device 3. It would be just a hint and I could ignore it if I like. I'm not sure about GrB_mxm where all 4 matrices might live on different GPUs. Perhaps GxB_set (A, GxB_GPU_DEVICE,3) would just tell me to tell the Rapids memory manager where the data for this matrix should migrate to. I would not have to do anything else, just let RMM handle the rest.

I haven't implemented this GxB_Context object yet but I'm going to start work on it soon. It is absolutely essential to let GraphBLAS use the GPU, and it will greatly enhance the parallel-library composability of GraphBLAS when using OpenMP.

In the future, for a v3.0 C API, we could add the context object as the first (or last, as you like) parameter to ALL GrB functions, even the seemingly trivial ones like GrB_Type_new that don't seem like they need it. It would be odd to have it in GrB_Matrix_nvals but not in other GrB methods.

See also https://github.com/GraphBLAS/graphblas-api-c/issues/48 which is closely related to this issue (another way to solve it). I think issue #48 is not the best way to solve this problem, however. The descriptor should be specific to an individual call to GrB_*. Pre-defined descriptors are read-only and handy to use. Trying to fit the context into the descriptor makes this a little awkward.

In a future distributed-memory API, the context could contain things like an MPI communicator, but I haven't thought through how that would work.

DrTimothyAldenDavis commented 1 year ago

The C++ API committee may want to consider this as well. I'd like to hear their feedback.

rayegun commented 1 year ago

Sorry I deleted a previous comment. Can this not be a GrB_Global_set?

Is there an architectural reason we couldn't have user thread local fields of the global?

DrTimothyAldenDavis commented 1 year ago

No, it can’t be done with a global setting. User thread 0 may want to use GPU 0, user thread 1 uses GPU 3, user thread 2 uses 4 omp threads, and user thread 3 uses 2 omp threads , etc.

Can’t be done with a descriptor. GrB Methods do not have a descriptor but they must have this Context.

Must be thread local.

On Sat, Dec 10, 2022 at 12:36 AM Will Kimmerer @.***> wrote:

Sorry I deleted a previous comment. Can this not be a GrB_Global_set?

— Reply to this email directly, view it on GitHub https://github.com/GraphBLAS/graphblas-api-c/issues/74#issuecomment-1345155976, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYIIOOFMFVLS4OFQM265ILWMQQIJANCNFSM6AAAAAASZPZXBQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Sent from Gmail Mobile

rayegun commented 1 year ago

We can add Descriptor to each method in 3.0, and that is, I believe, vaguely planned for that release. Predefined descriptors are indeed something you lose, but if you are manually specifying the number of GraphBLAS threads to be used in some instance I think you expect to gain some verbosity.

DrTimothyAldenDavis commented 1 year ago

Relying solely on the descriptor would be awkward, and it would make composing libraries difficult. If an LAGraph method uses 4 different descriptors, for example, each one would have to be revised to add 'use GPU 3' (or whatever).

BenBrock commented 1 year ago

For C++, we will likely want to integrate with the senders and receivers API, assuming that eventually makes its way into the standard. That would formally introduce the concept of a scheduler into C++, which would abstract a hardware resource like a GPU, threadpool, etc. The scheduler would essentially work just like the context in your proposed API (you could create a custom GraphBLAS context that functions as a scheduler, but you could also potentially interact with other independent schedulers).

DrTimothyAldenDavis commented 1 year ago

I have a rough draft of my GxB_Context object working. I've gotten some good performance results with gcc 12.2.0 and icx 2022.0.1, using the GxB_Context object to support nested parallelism. With gcc 9.4.0, nested parallelism is very slow. See this demo:

https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/v8.0.0.draft.1/Demo/Program/context_demo.c

and the results and discussion here:

https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/v8.0.0.draft1/Demo/context_demo.out

What this GxB_Context allows is a simple way to do nested parallelism, where the user can do:

#pragma omp parallel for num_threads(nouter)
for ( ... )
{
        create a GxB_Context, set its # of threads to use as ninner
        call lots of GrB methods
}

and what you get is something like this, and you get good performance with simple code:

#pragma omp parallel for num_threads(nouter)
for ( ... )
{
        // inside GraphBLAS, you get loops like this
        #pragma omp parallel for num_threads (ninner)
        for ( ... ) { do stuff }
        #pragma omp parallel for num_threads (ninner)
        for ( ... ) { do more stuff }
        ...
}

For this demo, using the descriptor would be impossible since GrB_Matrix_build has no descriptor. If there were lots of calls to many GrB methods, with lots of descriptors, then each descriptor would have to be linked to the GxB_Context of the user thread. That would be awkward. In context_demo.c, I could easily call LAGraph unmodified (except I would have to be careful to ensure the LAGraph algorithms do not try to call LAGraph_set_nthreads themselves...).

DrTimothyAldenDavis commented 1 year ago

See also https://github.com/GraphBLAS/graphblas-api-cpp/issues/14