GraphBLAS / graphblas-api-c

Other
7 stars 3 forks source link

Consider adding descriptor to all methods #48

Closed DrTimothyAldenDavis closed 1 year ago

DrTimothyAldenDavis commented 2 years ago

I would like to suggest adding the descriptor parameter to all methods, even GrB_Matrix_nvals. I use the descriptor to control the # of threads the method can use, with a GxB_NTHREADS setting. The default is the use the global setting, which I can also modify, and the default for that is the result from omp_get_max_threads. This control is essential for composing parallel libraries together.

There is no such per-call thread control in the BLAS, and this is an incredible headache when trying to call the parallel BLAS from a parallel application. If two user threads both call GraphBLAS at the same time, it is essential to let those user threads tell me how many threads I can use inside GraphBLAS. All my OpenMP #pragmas use the num_threads(nthreads) parameter.

If a method has no descriptor, I can only use the global setting, but two different user threads might want to tell me to use different # of threads in each call.

The GxB_NTHREADS setting tells me, per-call, the max # of threads I can use. I sometimes use less, if the problem is small.

DrTimothyAldenDavis commented 2 years ago

Will a future "Context" mechanism handle this problem?

tgmattso commented 2 years ago

Frankly, I like the idea of an expanded descriptor to handle this situation. A context mechanism could handle this as well.

This problem is really ugly. It is larger than GraphBLAS and goes to the core of parallel programming systems in general. For years the MKL group at Intel called omp_in_parallel() to test if their library function was being called from a parallel region in OpenMP. If so, they set the number of threads to 1 just to avoid the sort of problem you are talking about. That's an ugly solution.

You can at the application level define nested parallel regions and control the number of threads at each level in the nest. That probably won't help you, however, in writing a library.

michelp commented 2 years ago

When this issue was created it struck me how similar adding a descriptor to every operation was essentially a context.

I like this approach because it's very easy for language binders like @eriknw , @Wimmerer and I can adapt our existing descriptor patterns pretty easily to cover all operations without having to create a whole new Context type and introduce it.

michelp commented 2 years ago

In addition if we add a descriptor to every operation, this would be an entirely b/w compatible change for Python. Not for C but it would be the minimum change for C vs adding a new Context. Given Julia also has optional keyword arguments I bet it would be completely b/w compatible for @Wimmerer as well.

DrTimothyAldenDavis commented 2 years ago

I have two GxB extensions to control threads in the descriptor, and also a global setting.

GxB_set and GxB_get are _Generic methods that can set options in the global state, in a matrix/vector, or in a descriptor.

GxB_set (GxB_NTHEADS, 5) ; 
GxB_set (GxB_CHUNK, 100000) ; 

The above settings tell all graphblas methods/operations (even those without a descriptor) to use at most 5 threads, or if "work" is the amount of work to do, then use no more than max(1, floor (work/1000000)) threads.

These settings can be done in a descriptor, so the global state (used by all user threads) is not changed:

GxB_set (desc, GxB_NTHREADS, 5)
GxB_set (desc, GxB_CHUNK, 1000000)

for example. The descriptor level setting is what is needed if two user threads each call GraphBLAS in parallel. Say the total core count is 8, then one user thread could ask GraphBLAS to use 3 threads and the other 5 threads, and if nested parallelism is enabled in OpenMP, no over-subscription happens (which kills performance). The global setting cannot be used for this.

rayegun commented 2 years ago

This is indeed backwards compatible for me. Although I try to minimize the use of the descriptor (preferring things like lazy transpose operators), there is a kwarg for it, and this change would be non-breaking.

DrTimothyAldenDavis commented 2 years ago

Tim M. writes:

_For years the MKL group at Intel called omp_inparallel() to test if their library function was being called from a parallel region in OpenMP. If so, they set the number of threads to 1 just to avoid the sort of problem you are talking about. That's an ugly solution.

Yes, this is very ugly. They are forced to do this because the BLAS and LAPACK do not provide any API for controlling the number of threads.

It's not just ugly, it has a horrific impact on performance of my codes. I write multifrontal sparse solvers (LU, Cholesky, QR) which create a tree of frontral matrices. The tree can be very irregular. Each frontal matrix is a dense rectangular or square matrix that requires partial factorization (say its m by n and the first k < min(m,n) pivots are factorized, for some specific k, resulting a Schur complement of size (m-k)-by-(n-k).

The fronts start teeny at the bottom of the tree, and get bigger as you go up. In the bottom, I need single-threaded BLAS calls on each front, and parallelism across the fronts. In the middle, I need both tree parallelism and parallelism inside the BLAS; I need both. At the top, I have no tree parallelism but just need all threads to be used inside the BLAS.

This is a mess, because in the middle of the tree I cannot tell the stinking DGEMM, "Hey, use just 2 threads here; my other threads are busy with the rest of the tree". I can't do this because DGEMM has no parameter to control the # of threads used.

That is a historical mistake, which should not be repeated in GraphBLAS.

I work around this in GraphBLAS by allowing the # of threads to be controlled by the descriptor.

... except that not every GraphBLAS method has a descriptor. This very discouraging. We should not repeat the historical mistakes of the BLAS and LAPACK. I was assuming the GrB_Context would fix this in v2.0 but this hasn't happened. So my proposed solution is that all GrB* methods need a descriptor, or nearly all of them. Even GrB_Matrix_nvals, since it may need to call GrB_wait on the matrix first, and GrB_wait is a parallel operation.

Some methods could skip the descriptor, such as GrB_Matrix_new, GrB_Typenew, GrB*free, and so on, although it might be simplest just to allow a descriptor for nearly all GraphBLAS functions. GrB_init and GrB_finalize wouldn't be able to use a descriptor, and they don't need one anyway.

Yes, it adds, ", NULL" at the end of most calls to GrB_whatever (to use the default # of threads). The cost of leaving this descriptor out is to continue the same mangled hack that the Intel MKL team uses for the BLAS, which is worse.

DrTimothyAldenDavis commented 1 year ago

I think a better solution to this problem is the GxB_Context (see #74). I'd like to close this issue.

tgmattso commented 1 year ago

Yes close this. I need to write up a detailed proposal for the context object and create a new issue for it. I need to do this soon.

--tim

From: Tim Davis @.> Reply-To: GraphBLAS/graphblas-api-c @.> Date: Sunday, April 23, 2023 at 5:22 AM To: GraphBLAS/graphblas-api-c @.> Cc: Tim Mattson @.>, Comment @.***> Subject: Re: [GraphBLAS/graphblas-api-c] Consider adding descriptor to all methods (#48)

I think a better solution to this problem is the GxB_Context (see #74https://github.com/GraphBLAS/graphblas-api-c/issues/74). I'd like to close this issue.

— Reply to this email directly, view it on GitHubhttps://github.com/GraphBLAS/graphblas-api-c/issues/48#issuecomment-1519054822, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATVME3GNOJJWS5P7IRIUNLXCUNIZANCNFSM5FC2C2FA. You are receiving this because you commented.Message ID: @.***>