Reduce interference between Qthreads and OpenMP

ronawho commented 6 years ago

Qthreads and most OpenMP implementations pin pthreads for better affinity/performance. This ends up causing a significant amount of interference such that both OpenMP and qthreads performance suffer.

Typically we run into this when calling out to BLAS/LAPACK or some external library that uses OpenMP under the covers. We usually recommend disabling pinning and decreasing qthreads spinwaiting by setting:

export QT_AFFINITY=no
export QT_SPINCOUNT=300

but this is far from ideal, and it's not clear if that's always a good idea. I'm also not really sure what the right solution is here, so this issue is mostly to serve as a place to collate information and start investigating.

ronawho commented 5 years ago

Prototype code to enable dynamic unpinning/pinning of qthread threads is at https://github.com/ronawho/chapel/tree/qthread-dynamic-pinning. With that you can do something like:

chpl_disableAffinity()
external_multithreaded_c_code();
chpl_enableAffinity()

bradcray commented 3 years ago

@ronawho: I was talking to a user today about this issue and was unable to explain very well why the two models interfered with one another in situations where they aren't actually running concurrently. Thinking about it more now that I'm not on the spot, is it simply because Chapel tends to busy-wait too aggressively? Following some of the links in the OP to this comment makes me think yes?

[edit: though you also mention the pinning... how is the pinning problematic if there was no busy-waiting at all?]

ronawho commented 3 years ago

Last time I looked at this was for https://github.com/npadmana/DistributedFFT/issues/3. My memory is hazy, but I'm pretty sure just limiting spin-waiting wasn't enough. Something about Chapel and OpenMP both being pinned hurt performance even if our threads were asleep waiting on a condition variable. My guess at the time was pinning impacts the kernel's scheduling priority, but I never got to the bottom of it.

Assuming it is pinning related and not just that I missed somewhere with spin-waiting I think we could update qthreads to unpin automatically when threads go to sleep waiting for more work.

bradcray commented 2 years ago

@ronawho: I wanted to check before making a potentially incorrect change, but: Stumbling across https://github.com/chapel-lang/chapel/issues/11392, and looking at the source code for SPINCOUNT, I'm thinking we should update the OP here to suggest using CHPL_RT_OVERSUBSCRIBED instead, does that sound right?

MarjanAsgari commented 2 years ago

Hi Everyone,

I want to share my experience regarding QT_AFFINITY and QT_SPINCOUNT variables here. In my Chapel program, I had to run a complex and multi-thread Hydrological Model on each locale (simulations) using SpawnShell() and shell commands. The problem I was observing was the huge run time of each simulation (a single simulation on my own computer without parallelization was talking 8 minutes while on each locale in chapel was taking 60 minutes). It was a big problem in parallelizing the simulations of such complex external models using Chapel, since for example, my model needs at least 500 simulations to get calibrated. I learned that when Chapel runs, by default it takes all computer cores under control to increase the performance. This is wonderful since it increases the speed of the internal tasks BUT if there is no need for running external models inside these programs. Now, if chapel has taken all cores to ”its” tasks and when we open a sub-process inside it to run the hydro model, the computer has no choice rather than running the external model on currently occupied cores by the Chapel (because there is no idle cores). So, this was the reason why the running time was huge.

Then, I contacted Chapel research team since after researching a lot I found that export CHPL_RT_NUM_THREADS_PER_LOCALE could force chapel to leave some cores be free on the OS. So, I thought if I get computers with 8 cores, I could force chapel to use 3 and leave 5 to be free. Then, when the hydro model wants to be run, it uses all these 5 cores and performance increases. But it was not working for my model; based on my communication with chapel team, they told me since mu model is multi-thread I need to set another two other environmental variables. So, I had to set three variables.

export CHPL_RT_NUM_THREADS_PER_LOCALE=3 export QT_AFFINITY=no export QT_SPINCOUNT =300

I set those variables and the running time of the hydrological model decreased to 13 minutes from around 64 minutes. I had no performance degradation in my own Chapel program. Note: In my case, I could remove QT_SPINCOUNT, and still get the result I wanted.

In addition, I tried the following variables instead of QT_ variables to see whether they could have the same impact or not. And they had! So, the two following commands can also be used as a instead of the above variables as well.

export CHPL_RT_NUM_THREADS_PER_LOCALE=3 export CHPL_RT_OVERSUBSCRIBED=yes

Many thanks to Chapel support team!

ronawho commented 2 years ago

@ronawho: I wanted to check before making a potentially incorrect change, but: Stumbling across https://github.com/chapel-lang/chapel/issues/11392, and looking at the source code for SPINCOUNT, I'm thinking we should update the OP here to suggest using CHPL_RT_OVERSUBSCRIBED instead, does that sound right?

Hmm, it's true that today CHPL_RT_OVERSUBSCRIBED largely just lowers the qthreads spincount and disables pinning, but it may do other things in the future that could hurt performance in other ways, so I'd probably just leave the current suggestion in place for now. That said, given that the current suggestion uses env vars that aren't meant to be user facing, maybe we'd want chapel level replacements like:

CHPL_RT_WAIT_POLICY={active,passive}
CHPL_RT_SPINCOUNT={INTEGER}
CHPL_RT_PROC_BIND={true,false,close,spread}

or something (these names are based off the OMP (and for spincount GOMP) equivalents

chapel-lang / chapel

Reduce interference between Qthreads and OpenMP #9882