Multithreaded OpenBLAS heuristics

PallHaraldsson commented 1 year ago

In the docs:

When you start Julia in multithreaded mode, i.e. julia -tX or JULIA_NUM_THREADS=X, it is generally recommended to set OPENBLAS_NUM_THREADS=1 or, equivalently, BLAS.set_num_threads(1).

That's a heuristic, likely not always right, but I suggest either:

If this is true, then do it automatically for the user, and have a way to get the old number of threads that the current implementation would have given.
It's plausible it works better, because the other threads are simply busy doing other non-BLAS work. Is there a good middle-ground then, use half as many threads?
It's also plausible that the recommendation simply makes things slower, the other threads could be doing no work even, then we have a dilemma. I'm not sure but are calls to OpenBLAS serialized? It seems they should be and would that also help, and then not have to change any settings, or adjust less, for multi-threaded?

ctkelley commented 1 year ago

I happily use the max number of threads for NUM_THREADS and set BLAS_NUM_THREADS to max as well. I pay attention and avoid doing crazy stuff like putting BLAS calls in a loop I've threaded with floops for example.

I have never been punished for this, but certainly could be.

elextr commented 1 year ago

It should depend on the application.

If lots of compute threads make lots of blas calls then they will often be blocked on blas and their hardware will be available to blas, so in that case it works well for the number of compute threads to be similar to the number of blas threads to be similar to the number of hardware threads. Manually dividing the hardware is likely to starve both compute and blas.

But if only a few compute threads are making blas calls and the rest are crunching on other things then their hardware won't be available to blas, so if the OS scheduler does a poor job of resource sharing then manually setting compute threads and blas threads to half the hardware threads might help.

But picking the most common situation to make it the default is likely a :bike: :house:

PallHaraldsson commented 1 year ago

I'm thinking what should be the sensible defaults for most users, that don't want to, or at least don't, think about performance tuning. I was at first just suggesting doing automatically what's already suggested in the docs. BUT note, I feel like there should be a way to undo it in the hands of the programmer.

After thinking more about this, could the default be to split the difference?

Let's say you have 16 cores like I do, then I think I get 16 for BLAS, and only 1 thread (the default), or 16 threads with -t auto, and and 1 for BLAS (with documented setting that could become the default). Both are 16 1 = 1 16 = 16. But so is 4*4.

Should maybe both be set to the ceil(sqrt(nr_cores))? What are the bad effects of that? You get a a 4th of the threads and assuming linearally scalable you only a 4th of the performance, but that's not always a good assumption anyway so more, and with those threads using BLAS, all cores are used.

[Also 2 for BLAS and 8 for threads would seemingly be good.]

If lots of compute threads make lots of blas calls then they will often be blocked on blas

I'm not sure if you're confirming, but do you mean if many threads what to do BLAS, then all but one gets blocked, i.e. they will be serialised by a queue? Or just that you would try to oversubscribe try to use 16*16=256 threads on only 16 cores, and that's not very effective (for cache reasons)?

elextr commented 1 year ago

but do you mean if many threads what to do BLAS, then all but one gets blocked, i.e. they will be serialised by a queue

I didn't say they are explicitly serialised, or that only one runs in blas, thats Openblas internals that I didn't find in my quick dig.

I was taking a more helicopter view, if many threads make blas calls and blas calls are generally slow then those calls demand a throughput that blas can't deliver, so some of the threads will be delayed while their blas operation happens. And whilst they are delayed the hardware that was executing the calling thread is available to execute something else, like a blas threadpool thread.

PallHaraldsson commented 1 year ago

I think, while not sure, that BLAS calls should be serialized, at least for larger matrices (and maybe only for matrix multiply, I think it's the main operation we care about, I'm not sure for what else the below argument would apply for):

Matrix multiply is O(n^3) in number of operations, but it's O(n^2) in number of memory traffic, if and only if the matrix fits in [L3] cache. So even one other BLAS, or any concurrent code could ruin that.

This is also, of course, an argument to make BLAS finish as quickly as possible, i.e. the most, or optional number of threads, allocated for it. Would it be plausible, and a good idea, to when BLAS is run for a large matrix, that you suspend half your threads, and give them to BLAS while it runs?

elextr commented 1 year ago

What I should have made explicit at the end of my last post was "... like a blas threadpool thread, but only if the blas threadpool size is big enough to use it". And then add, "but not so high that it steals all hardware from other work the Julia threads need to do".

And Julia of course makes it very easy to have multiple threads calling blas, inside @threads for just do a calculation that buried in a library calls blas. The user may not even be aware they are using blas. The best allocation of hardware is going to depend on the ratio of blas vs non-blas work in the loop body, if there is significant non-blas work but blas hogs all the hardware the julia threads are prevented from calculating the data for the next blas call in the loop and the benefits of the parallelism is lost.

Then again the default in Julia is single threaded, so blas calls are naturally serialised, and so as you say blas has maximum hardware available to it, so long as the number of blas threads is high enough to use it.

Also you make a good point that memory cache effects can have an impact, oversubscribing, more threads that want to run concurrently than hardware available, means at least some threads will get kicked off the hardware and potentially have to reload the cache from memory when they return. That can be terribly expensive.

I'm sorry I don't have a simple solution, even after several decades of programming multi-threaded systems, the answer always come down to "it depends". For sophisticated users looking for maximum performance "benchmark it" is the answer, but the best solution for the default might be political not technical (support the most common case).

JuliaLang / LinearAlgebra.jl

Multithreaded OpenBLAS heuristics #1028