Poor scalability on Multi-Core CPU systems (OpenBLAS to blame)

Technologov commented 5 years ago

looking at Leela again, I can tell you that it's basic algoritm is scalable, if it can run on a GPU. Non-parallel algo cannot.

But it's implementation on a CPU is a disaster -- OpenBLAS backend to blame.

I call to find a replacement for OpenBLAS for Multi-Core CPU systems.

Take a look: Totally new Dual Xeon Linux server.

BigFenix: Dual Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz (32 core; 64-thread) + Ubuntu 18.04 Linux x64 (Broadwell) (OpenBLAS) Engine : Leela 0.20.0 + Neural Network ID 32400 (20/256). ~/leela-0.20/lc0 benchmark -t 32

thread- 1: Nodes/second : 42 thread- 4: Nodes/second : 142 thread- 8: Nodes/second : 220 thread-16: Nodes/second : 270 (!) thread-32: Nodes/second : 222 thread-64: Nodes/second : 115 (WOOT !!! OMG !!! Such a slowdown on hyper-threaded CPU !)

For the sake of comparison of what a well-written algorithm looks like on a CPU: (stockfish-9)

BigFenix: Dual Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz (32 core; 64-thread) + Linux x64 stockfish-9-linux/Linux/stockfish-9-bmi2 bench 16 64 thread- 1: Nodes/second : 2204312 thread- 2: Nodes/second : 4216857 thread- 4: Nodes/second : 8221277 thread- 8: Nodes/second : 16288761 thread-16: Nodes/second : 32518175 thread-32: Nodes/second : 56162013 (25.5x times faster ! Yay !) thread-64: Nodes/second : 71519468 (32.4x times faster ! Yay !)

stockfish-9 scales very well on servers and workstations, up to 64 processors.

Another example of poorly written multi-Core algorithm: Stockfish again ! Version 6. (stockfish-6)

BigFenix: Dual Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz (32 core; 64-thread) + Linux x64 stockfish-6-linux/Linux/stockfish_6_x64_bmi2 bench 16 $CPU_CORES

thread- 1: Nodes/second : 2342722 thread- 2: Nodes/second : 3738296 thread- 4: Nodes/second : 4828941 thread- 8: Nodes/second : 3747961 (Regression !) thread-16: Nodes/second : 2706391 thread-32: Nodes/second : 1671791 (Disaster ! Performance falls under 1-core CPU !) thread-64: Nodes/second : 770190 (Terrible.)

Basically stockfish-6 cannot scale above 4 or 6 cores, and it's performance drops like a stone.

gcp commented 5 years ago

OpenBLAS threading is a huge nuisance, but I'm not sure you have evidence here that's the problem? Is lc0 tuned for more than 32 threads?

Upstream Leela has Eigen as an alternate backend, but it's a bit more annoying for release packages because it only supports the CPU architecture it's built on. It's not a dynamic lib like OpenBLAS.

frpays commented 5 years ago

I am not sure OpenBlas is to blame. You can build the BLAS backend with MKL or Apple/Accelerate. All libraries have comparable results. Note that threading is switched off by default (it's a backend parameter) except for Apple/Accelerate where it is not in our hands. The scalability issue is not a BLAS threading issue. It's about distributing node computations over single threaded backends with appropriate hubs (mux, redux). There are also parameters that will help you find optimal performances (max-prefetch). My impression is this is an ongoing issue. See https://github.com/LeelaChessZero/lc0/issues/35.

mratsim commented 5 years ago

Hi there,

Chiming in as I was looking into threading strategies for BLAS. Be aware that I'm not aware at all of LeelaChessZero architecture but I have in-depth knowledge on BLAS optimizations (see https://github.com/pytorch/glow/issues/1749#issuecomment-459265067).

1. Matrix size

The matrix sizes of LeelaChessZero may be too small to partition. Below 128x128, there is no gain from parallelization, instead you are splitting a matrix that fits in the L1/L2 cache on multiple cores and you do not amortize the cost of copy on a long enough computation.

Solution: Parallelize at a higher level. Also using a naive matrix multiplication might be faster due not much less overhead. Alternatively you can use libxsmm but it's x86 only.

Note that chess is 8x8, splitting 8x8 on 32 cores

2. NUMA

You have 2 sockets, this is usually quite tricky to parallelize as the library needs to make sure that threads are not migrated between sockets and cores and it's likely that the threading library does not allow enough control on thread placement. Also the BLIS paper Anatomy of High-Performance Many-Threaded Matrix Multiplication recommends partitioning on sockets when one of the dimension is over 4000 (called NC in the paper)

3. Eigen

Eigen performs about the same as OpenBLAS, sometimes worse sometimes better. Note that those benchmarks are single-threaded.

See: http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page14/index.html (quite old benchmark)

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/glas-gemm-benchmark.html

https://github.com/attractivechaos/matmul

gcp commented 5 years ago

Chiming in as I was looking into threading strategies for BLAS.

The BLAS shouldn't be threaded at all, as was explicitly pointed out in the post before yours (it's disabled!). The engine should be capable of generating enough threads with independent work.

When I complained about OpenBLAS threading, I'm referring to it having compile-time limitations on the maximum number of threads even when you're not using the threading at all (MAX_CPUS etc). Eigen doesn't have this problem. But this has got nothing to do with threading the BLAS computation itself, which something like lc0 would want to strongly avoid.

Naphthalin commented 4 years ago

Given the performane on CCC/TCEC CPU hardware, I think this issue can be considered to be solved.

LeelaChessZero / lc0