OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.36k stars 1.49k forks source link

OpenMP thread placement and affinity #1653

Open brianborchers opened 6 years ago

brianborchers commented 6 years ago

In my testing, on a 4-core two-way hyperthreaded Xeon-W Skylake machine, I've found that the following environment variable settings produce consistently high performance:

OMP_NUM_THREADS 4 OMP_PLACES "{0,1,2,3}" OMP_PROC_BIND spread

This tells the OpenMP library to allow up to 4 threads, tells the OpenMP library that it can start threads on cores 0-3 (and can't use the hyperthreaded siblings 4-7) and that threads should be spread out over the cores as they're started. I believe that it also implies thread affinity so that threads won't move between cores.

I find that if I don't set these environment variables, the performance is generally worse and can also be much more variable. e.g. on a simple test of matrix multiplication, with OMP_NUM_THREADS=4 the run times varied from 6.23 to 8.94 seconds in four tests. After setting OMP_PROC_BIND and OMP_PLACES, the run times varied from 5.44 to 5.49 seconds in four tests.

Is there any more general advice on how to control thread placement and affinity for the best performance with OpenBLAS? What about systems with more cores and multiple sockets? Could information about this be added to the documentation?

brada4 commented 6 years ago

There is no one size fits all guide... In effect you did equivalent of disabling hyperthreading using variables.

martin-frbg commented 6 years ago

@brada4 can you then suggest a better solution for this case ? E.g. would OMP_NUM_THREADS=8 work with the given definition of OMP_PLACES to add one hyperthread on each core, or are things not that simple ? I agree that documentation on this - in the github wiki or elsewhere would be helpful.

brianborchers commented 6 years ago

I believe that if I set OMP_NUM_THREADS=8 with the OMP_PLACES="{0,1,2,3}" then it would run 2 threads on each core with no hyperthreading- places 4-7 are the hyperthreaded siblings of cores 0-3.

brianborchers commented 6 years ago

I'll add that MKL seems to get this right without any intervention from the user.

martin-frbg commented 6 years ago

I'll comment that it would be kind of sad if it did not, with a team of paid professionals behind it.

martin-frbg commented 6 years ago

From http://forum.openmp.org/forum/viewtopic.php?f=3&t=1731 you could try setting (only) OMP_DISPLAY_ENV=TRUE to see what the libgomp default behaviour is, and OMP_PLACES=cores to get (probably) the same behaviour as from your explicit list of cores. (And does MKL make use of hyperthreading at all on your system ?)

brianborchers commented 6 years ago

Interestingly, I don't see MKL using more than 4 threads on this system, even on fairly large tasks and even with OMP_PLACES left unset and OMP_NUM_THREADS set to 8.

The OMP default environment is:

OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP = '201511' OMP_DYNAMIC = 'FALSE' OMP_NESTED = 'FALSE' OMP_NUM_THREADS = '8' OMP_SCHEDULE = 'DYNAMIC' OMP_PROC_BIND = 'FALSE' OMP_PLACES = '' OMP_STACKSIZE = '0' OMP_WAIT_POLICY = 'PASSIVE' OMP_THREAD_LIMIT = '4294967295' OMP_MAX_ACTIVE_LEVELS = '2147483647' OMP_CANCELLATION = 'FALSE' OMP_DEFAULT_DEVICE = '0' OMP_MAX_TASK_PRIORITY = '0' OPENMP DISPLAY ENVIRONMENT END

I tried OMP_PLACES=cores, but with OMP_NUM_THREADS unset, it used 8 threads and performed poorly- it appears that "cores" includes all 8 of the virtual cores that GOMP sees.

I also tried OMP_PLACES=cores, with OMP_NUM_THREADS=4. This ran at about the same speed as specifying OMP_PLACES="{0,1,2,3}" but htop showed that it was shifting work from (e.g.) core 1 to its sibling core 5 (core 1's hyperthreaded sibling) and back.

I also tried using OMP_PLACES="{0,1,2,3}" and OMP_NUM_THREADS=8. With this configuration the performance was poor and htop showed that only cores 0-3 were in use even though there were 8 threads. Thus it wasn't using hyperthreading.

I conclude that

OMP_PLACES="{0,1,2,3}" is effective at stopping the system from using hyperthreading.

OMP_PLACES=cores doesn't stop hyperthreading.

OMP_NUM_THREADS=4 is consistently better than 8.

OMP_PROC_BIND=spread seem to keep the threads pinned to the cores they started on.

brada4 commented 6 years ago

t was shifting work from (e.g.) core 1 to its sibling core 5 (core 1's hyperthreaded sibling) and back.

Is there any performance impact? In principl cache is shared and should be close to none...

Can you modify intel perf bias register and confirm that in general one hyperthread of 2 gives same or better numeric performance as both? It is documented limitation of old HT Atom for example.

brianborchers commented 6 years ago

There was no apparent performance impact to the switching between sibling hyperthreaded virtual cores so that probably isn't hurting- I agree that there's no theoretical reason that it should hurt much since the cache is shared. However, the OS does have to do some bookkeeping to move a thread between cores even if the move is just to a sibling hyperthreaded virtual core.

I don't know what the "intel perf bias register" is?

brada4 commented 6 years ago

This one: $ man 8 x86_energy_perf_policy It is specific to intel processors, it reprograms processor for speed/power efficiency, but also levels resources available to hyperthreaded cores

The accounting done for such process move is minimal , because all memory context is "hot" in shared L3 cache, thus not really much more work than normal context switches for timer/stats interrupts.