OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.5k forks source link

Performance issue with many cores #1881

Closed jeremiedbb closed 7 months ago

jeremiedbb commented 5 years ago

Running the following code on a machine with many cores will give worse performances than limiting the number of threads.

import numpy as np
X = np.random.random_sample((2048, 2048))
%time np.linalg.svd(X)

I built numpy against OpenBLAS master (I observe the same thing with the OpenBLAS shipped with numpy from conda-forge).

Here are the timings I get for the previous code. The machine has 88 cores (44 physical + hyperthreading)

Below is comparison with MKL

It brings another issue: on cpus using hyperthreading, OpenBLAS will use the maximum number of threads possible which is twice the number of physical cores. But it should use only as many threads as the number of physical cores, because all BLAS operations don't really benefit from hyperthreading (hyperthreading is meant to parallelize tasks of different nature, which is the opposite of BLAS).

martin-frbg commented 5 years ago

Will need to see what BLAS calls np.linalg.svd makes... Without promising any improvement, could you please try with the OpenBLAS "develop" branch ? ("master" is currently stuck at 0.2.20 so may actually be older than what ships with numpy, all subsequent releases have been made from the 0.3 branch). It will probably depend on the individual hardware and problem size whether hyperthreading helps or not - when I brought up the topic in some other issue a while ago, a contributor from Intel argued against leaving the HT pseudocores unused on reasonably current hardware. The actual issue here could be that OpenBLAS creates too many threads for some subtask with a small matrix size - there have been a few changes in recent versions that address this in individual BLAS functions.

jeremiedbb commented 5 years ago

could you please try with the OpenBLAS "develop" branch ?

sorry I said master but it was against the develop branch. I didn't notice the main branch name was not master :)

a contributor from Intel argued against leaving the HT pseudocores unused on reasonably current hardware

It's strange because MKL never uses those :)

jeremiedbb commented 5 years ago

It seems that svd calls xGESVD or xGESDD with x = s or d

martin-frbg commented 5 years ago

Thanks. Strictly speaking this is LAPACK (which for OpenBLAS means the non-parallelized, non-optimized "netlib" reference implementation), the task now is to follow their call tree and see which BLAS functions get involved.

martin-frbg commented 5 years ago

From http://www.netlib.org/lapack/explore-html/d1/d7e/group__double_g_esing_ga84fdf22a62b12ff364621e4713ce02f2.html this seems to be mostly GEMM/GEMV, SWAP,TRMM/TRMV which I think had already been checked for unnecessary thread creation. (This however was mostly to avoid senseless multithreading for small inputs, most if not all still take an all-or-nothing approach while MKL has been claimed to limit the number of active threads according to problem size.)

jeremiedbb commented 5 years ago

Thanks. I guess #678 is addressing the same concern, but it's still at the thinking process step :(

About the hyperthreading, I'd strongly argue against using pseudocores. On every machine I tested it, there was no improvement using those. Limiting the number of threads to use only physical was always better or even. And for the same problems, MKL never used the pseudocores.

Obviously it requires benchmarks, but I think it could be worth.

brada4 commented 5 years ago

Before l2tf one could adjust x86perf to get more out of single pseudocore while other is idle, now they are best left disabled for security and performance altogether

brada4 commented 5 years ago

I will quickly recheck the threading. @jeremiedbb what is your exact cpu(s)? Is it helping to limit threads to those present in one numa node as seen in numactl -H

brada4 commented 5 years ago

(0.3.3 pthread) there is the lock race around malloc taking 1-2-4-HT8 0-0.2-2-6% of time (which should be gone in develop branch after 0.3.3), but otherwise no visible spills it should grow much worse with lots of cores indeed. EDIT: indeed 200x more time is consumed in syscalls max vs 4 processors EDIT2: it is in initial post. If you could try if #1785 gives an improvement in the meantime?

jeremiedbb commented 5 years ago

@brada4 Thanks for looking into it !

the cpu is an Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz There are 2 sockets with 22 physical cores on each socket, and each one has hyper-threading.

There are 2 NUMA nodes. Limiting threads to a single numa node definitly helps but not as much as limiting threads to physical cores. EDIT: they perform equally.

If I limit to 1 numa node, OpenBLAS will use 44 threads on 22 physical cores on 1 numa node If I limit to physical cores, OpenBLAS will use 44 threads on 44 physical cores on 2 numa nodes

The second case runs twice faster regarding wall time and syscalls are twice lower. However both run slower that with only using 4 cores, and trigger way more syscalls.

If you could try if #1785 gives an improvement in the meantime?

I think I did try it because it was merged 1 month+ ago and I cloned the repo 2 days ago for this test.

jeremiedbb commented 5 years ago

I edited my previous comment because at first I limited the cores the code could use, but I didn't limit the number of threads for openblas (I only used taskset). Apparently, Openblas spawns as many threads as the machine has cores, even in a program restricted to run on a specific subset of core by taskset.

brada4 commented 5 years ago

There is another way that you set OPENBLAS_NUM_THREADS=22 (or 11 if CoD configuration is enabled in BIOS. check with numactl -H if you have 2 or 4 NUMA nodes) , then rely on OS to perform proper NUMA bindings. It could quite likely happen that 32MB input is dragged around NUMA link at pessimal speed to CPUs in other NUMA nodes. EDIT: I actually do such configuration on a multi-user systems for living and for other HPC libraries too.

jeremiedbb commented 5 years ago

numactl is not installed and I don't have root access. However, lscpu is enough here and shows 2 numa nodes (with their assigned core ids).

When setting OPENBLAS_NUM_THREADS=44, the OS does not groups all threads on the same numa node. Neither does setting OPENBLAS_NUM_THREADS=22. However, it makes OpenBLAS not use hyperthreading, and gives a huge improvement on performance.

When setting OPENBLAS_NUM_THREADS=44, and manually setting the affinity using taskset to only use 1 numa node, performance is basically the same as when setting openblas_num_threads. Setting OPENBLAS_NUM_THREADS=22 with the same affinity produces slightly better performance.

It could quite likely happen that 32MB input is dragged around NUMA link at pessimal speed to CPUs in other NUMA nodes.

I think you're right. With bigger inputs, performances are more like expected.

In summary:

Meanwhile, a simple solution is setting OPENBLAS_NUM_THREADS according to input size and environment conditions.

Thanks for your answers @brada4 @martin-frbg !

martin-frbg commented 5 years ago

When OpenBLAS is called from within a program which affinity has been manually set (e.g. using taskset), openblas_get_num_threads will still give all existing cores on the machine, sawning too many threads. I don't know if this is a known issue, or if this is expected.

I thought this should be solved already ( #1155 ) and init.c should only enumerate cpus in the current taskset, but I never tried my supposed fix on systems with multiple nodes. (Also the added code has various exit points for older libc or other restrictions on cpu enumeration, maybe I messed up there)

brada4 commented 5 years ago

Syscalls will be wildly reduced in 3.4. Check #1785 There is nothing specific yo numa nodes in particular code, though applying common sense there were n! potential lock interactions, which got measurably worse with dozens of threads.

MKL does that only from 2018, older takes all hyperthreads. Do you have any authoritative documentation that hyperthreads are waste? It is common sense, and a security problem for certain, but no clear statement they should be out for performance.

jeremiedbb commented 5 years ago

Do you have any authoritative documentation that hyperthreads are waste?

Absolutely not. As I said, it's just the result of experiences. It should not be taken for granted, but I think it allows to at least open the discussion.

brada4 commented 5 years ago

There is no statement past "go and try this trick" e. g. https://software.intel.com/en-us/mkl-linux-developer-guide-managing-multi-core-performance

martin-frbg commented 5 years ago

Might be worthwile to experiment with OMP_PROC_BIND and OMP_PLACES as well http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html (And I also note that setting cpu affinity is disabled in the default configuration of OpenBLAS as it would affect other code in situations where the library is dlopened by an interpreter prior to loading other stuff.)

brada4 commented 5 years ago

Actually there are no timings in regard to system times after lock patch. Strange how nproc times syscall reduction went unnoticed on 88 cores detected.

fenrus75 commented 5 years ago

also, 22 is .. unfortunate.

https://github.com/xianyi/OpenBLAS/pull/1846 which got merged as https://github.com/xianyi/OpenBLAS/commit/f1c02273cb9206a43968a0d813d1e9506612343c

is what I did on skylake to avoid some really bad behavior; one of these days I need to port a set of these fixes to haswell/broadwell.

basically what happens is that without the patch, openblas just divides, so

2048 / 22 = 93

93 is a really awful number from a SIMD perspective.... the patch in the commit will give much better behavior

brada4 commented 5 years ago

Not only SIMD, I spotted same with l1blas, like messing up basic (supposedly page, hopefully cache line) alignment for subsequent threads.

martin-frbg commented 5 years ago

Rereading #1846, it would seem to be sufficient to just copy the define GEMM_PREFERED_SIZE 32 line from the SKYLAKEX section of param.h to the corresponding HASWELL block (around line 1511 of the current develop/0.3.4 version) and rerun the numpy linalg.svd test

fenrus75 commented 5 years ago

maybe. One might need to do the equivalent of https://github.com/xianyi/OpenBLAS/pull/1846/commits/dcc5d6291e7b02761acfb6161c04ba1f8f25b502 to avoid nasty "we did not think a work chunk could be 0 sized" behavior

martin-frbg commented 5 years ago

Good question if blas_quickdivide can legitimately return zero (unless the work size is zero, in which case I wonder why we would get to that code at all).

fenrus75 commented 5 years ago

before the rounding to preferred sizes it wouldn't. but lets say width is 64, and 3 cpus. preferred size is 32

before you get 21 21 22 as sizes with the preferred size you get 32 32 0

(21 is one of those really nasty sizes for performance, so this case is a LOT faster with the rounding up)

but due to the rounding up you can get 0 work now

martin-frbg commented 5 years ago

Not sure I get this, as the (rounded) width gets deducted from the initial m (or n) shouldn't the divider loop exit after assigning two workloads of 32 each ?

brada4 commented 5 years ago

I will do tomorrow l regarding my alignment theory @fenrus75 there is no alihnment/chunk preference signaling in most cases, say l1 thread does know only number of elements in vector, but not S/CD/Z, i.e. their size.

brada4 commented 5 years ago

When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls

@jeremiedbb could you try in common.h

#ifndef YIELDING
// #define YIELDING     sched_yield()
#define YIELDING {};
#endif

(strace -o strace.out benchmark/dgemm.goto 1024 1024 1024)

Modern thread libraries de-schedule idle threads when idle implicitly, no need to force it,thats the syscal that was done 20000 times in 5s DGEMM.

@martin-frbg I think some small sleep should be here - it is sort of polling thread status.... That yielding syscall consumes (non-productively) CPU cycles when trivial sleep would just set timer in future. blas_server.c

      while(tsqq) {
        YIELDING; <---- 20000 syscalls.
        pthread_mutex_lock(&thread_status[queue->assigned].lock);
        tsqq=thread_status[queue -> assigned].queue;
        pthread_mutex_unlock(&thread_status[queue->assigned].lock);
martin-frbg commented 5 years ago

@brada4 remember this was tried before and the results were inconclusive at best.

brada4 commented 5 years ago

I think it is around places where now retired sched.compat_yield sysctl was operating, now we are stuck in the world with non-compat one. I think pthread_cond_wait is pthread equivalent not involving syscalls.

treadalus commented 5 years ago

I just scanned this thread. In my experience on AWS, hyperthreading causes performance issues. I use scripts to turn that off. Btw, I wrote a wrapper for Octave to control the number of threads that would fit nicely under OpenBLAS/benchmark/scripts/OCTAVE/ but I'm not sure how best to submit it.

fenrus75 commented 5 years ago

I've measured and for sgemm/sgemm it was a small gain but not a large loss or anything

On Thu, Dec 6, 2018, 20:35 treadalus <notifications@github.com wrote:

I just scanned this thread. In my experience on AWS, hyperthreading causes performance issues. I use scripts to turn that off. Btw, I wrote a wrapper for Octave to control the number of threads that would fit nicely under OpenBLAS/benchmark/scripts/OCTAVE/ but I'm not sure how best to submit it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xianyi/OpenBLAS/issues/1881#issuecomment-445001327, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPeFd7yRgi-QCzj10g82TkdK6HwWWmkks5u2XFzgaJpZM4Yvdty .

fenrus75 commented 5 years ago

I will retract my statement based on (new) data.

I have now encountered cases where hyperthreading with openblas is a severe hinderance.

HT is fine if both threads are doing basic math operations... both do productive work and while performance increase is limited, they also don't get in the way of each other

HOWEVER

OpenBLAS at times puts a thread in a polling mode doing yield() in a tight loop. This tight loop does a system call (with all the nasty TLB and cache effects in addition), takes many kernel locks as part of this etc etc. The result is that this polling thread utterly destroys the math performance of the other thread on the same core.... sometimes in the 2x range

martin-frbg commented 5 years ago

That brings us back to #1560 and #900 - how does the same scenario look with the YIELDING macro from common.h set to a couple of nop's (as it is done on ARM, POWER and a somewhat random selection of AMD cpus) ?

fenrus75 commented 5 years ago

on x86 (both Intel and AMD) one would use "pause" not "nop" (pause is also a hyperthreading / virtualization / power management hint)

I'll test. My initial fix was to count cores not hyperthreads.

there is also a root cause behind this that I am not sure of yet; the symptom is lots of time spent in yield()

but the cause is likely some other bug. I've observed in this specific scenario that a lot of "M = 0" dgemm calls got made by the level 3 syrk code.... maybe that causes some threads to finish early and just spin

fenrus75 commented 5 years ago

the nop (pause) does not help much at all (within measurement noise)

the HT fix is still almost 2x.

fenrus75 commented 5 years ago

next experiment was to make blas_thread_server just not yield, but always use cond variables. small gain for both with and without ht fix case...

(but I have a modern glibc, I suspect it might be doing a spinafore internally)

brada4 commented 5 years ago

The visible problem is that _yield makes syscall, nop does nothing. Some delay like .0001s could do same without grinding CPU cycles out of turbo mode.

fenrus75 commented 5 years ago

that was in the tree very very briefly... but got reverted since it hurt as well.

in the case that seem to matter the most, just using the condition variable, rather than first polling, seems to work just as well as polling.

there's a different problem a little bit, in that if I use taskset to reduce the number of cpus, performance goes up (I'll do a sweep tomorrow to see if there's a sweetspot, and then, if we can compute where the sweetspot is based on matrix size or something)

fenrus75 commented 5 years ago

On a 10 core/20 thread system, data below. All numbers are relative cost to the unthreaded (1) case, lower is better

Size 1 core 2 cores 3 cores 4 cores 5 cores 6 cores 7 cores 8 cores 9 cores 10 cores 11 threads 12 threads 13 threads 14 threads 15 threads 16 threads 17 threads 18 threads 19 threads 20 threads
96x96x96 1.000 1.181 1.517 1.087 1.146 1.496 1.540 1.065 1.135 1.144 1.152 1.365 1.494 1.515 1.495 1.068 1.159 1.159 1.141 1.078
128x128x128 1.000 0.781 1.062 0.657 0.628 0.912 0.911 0.675 0.763 0.632 0.655 0.841 0.899 0.896 0.914 0.675 0.755 0.760 0.768 0.610
192x192x192 1.000 0.719 0.669 0.648 0.684 0.440 0.406 0.535 0.515 0.664 0.670 0.441 0.440 0.401 0.403 0.539 0.539 0.539 0.541 0.656
256x256x256 1.000 0.687 0.612 0.415 0.455 0.472 0.478 0.331 0.295 0.362 0.359 0.432 0.444 0.495 0.487 0.333 0.331 0.295 0.293 0.366
384x384x384 1.000 0.606 0.429 0.350 0.360 0.256 0.258 0.265 0.277 0.278 0.288 0.291 0.259 0.262 0.269 0.281 0.271 0.290 0.285 0.296
512x512x512 1.000 0.588 0.440 0.323 0.326 0.267 0.262 0.206 0.205 0.210 0.261 0.284 0.292 0.295 0.293 0.216 0.194 0.192 0.193 0.197
768x768x768 1.000 0.259 0.180 0.141 0.126 0.102 0.101 0.082 0.084 0.086 0.133 0.100 0.101 0.091 0.091 0.090 0.088 0.089 0.089 0.088
1024x1024x1024 1.000 0.563 0.398 0.298 0.276 0.235 0.204 0.169 0.157 0.162 0.212 0.216 0.216 0.218 0.217 0.162 0.161 0.161 0.162 0.163
1536x1536x1536 1.000 0.560 0.192 0.146 0.127 0.103 0.094 0.084 0.083 0.068 0.118 0.100 0.094 0.093 0.094 0.080 0.080 0.080 0.074 0.073
2048x2048x2048 1.000 0.366 0.373 0.202 0.167 0.141 0.125 0.110 0.105 0.101 0.145 0.147 0.133 0.128 0.127 0.110 0.106 0.104 0.101 0.102
thijssteel commented 4 years ago

I think that besides pure performance, dynamic selection of the number of threads has other advantages:

I also think it would be interesting to look at the performance for rectangular matrices. The QZ algorithm contains many (k x k x n) multiplications with k << n and I've found those to be particularly problematic when the threads aren't limited.