OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.43k stars 1.51k forks source link

sgemv() not scaling well on AMD ZEN 2 / EPYC 7742 #2381

Open eastwindow opened 4 years ago

eastwindow commented 4 years ago

Hi,

I benchmarked OpenBLAS' 0.3.8 sgemv() on a dual AMD EPYC 7742 (64 cores each) for a 4000x4000 matrix and I find that the performance maxes out at about 48 threads. The peak in the histogram of 1,000,000 repetitions in a for-loop is at around 43 µs. Using 64 threads, the performance drops to about 48 µs, and to around 70 µs (with a lot of jitter to about 160 µs) if I use 128 threads. In the case of 48 or 64 threads, I limit the threads to one CPU using taskset.

If I run 4 separate processes simultaneously with 32 threads each (each process is limited to an exclusive set of 32 cores) doing a 1000x4000 MVM, the timing for each MVM gets down to around 22 µs, which is about what I would expect from the scaling I see up to 48 cores. (If only one 32 thread process is running doing the 1000x4000 MVM, the timing is around 18 µs).

brada4 commented 4 years ago

4000x4000 matrix is 128MB small Typical NUMA stride, i.e. chunk of continuous RAM at one CPU, is like 32 or 64 MB memory. Openblas does not have logic to be optimal here, you do right thing employing single NUMA node for task.

eastwindow commented 4 years ago

I'm using 32-bit floats, the matrix is actually only 64 MB large. Each of those AMD EPYC 7742 CPUs comes with 256 MiB L3 cache, so shouldn't the complete matrix should reside in the cache after some loop iterations and any RAM non-uniformity be irrelevant? The scaling (with one process) ends at about 3/4 of one CPU (48 cores of 64 cores), so no RAM NUMA involved here.

brada4 commented 4 years ago

There is a numa node per memory controller inside the chip. numactl -H to check. They dont take smaller chunk of data than many megabytes alone (by orders of magnitude bigger than cache line or 4k page), and thus it gets pulled between close caches around some internal bus.

eastwindow commented 4 years ago

In this CPU, 4 cores share 16 MiB of L3 cache. When you write "gets pulled between close caches" does this mean that L3 caches get evicted and replaced by other parts of a matrix all the time?

I was thinking - naively ? - that OpenBLAS when using 128 threads would break up the 4000x4000-sgemv() into "4000/128 by 4000" little MVMs so each core always deals with the same 32x4000 matrix chunk, which is 500 KiB, which should even almost completely fit into the 512 KiB L2 cache per core.

brada4 commented 4 years ago

OpenBLAS does not have instrumentation for such shared caches :-( They are there since Bulldozer and Ivy Bridge. If your task is fixed, you can try manipulating thread counts per call, but in universal case using all cores at once will not be optimal.

eastwindow commented 4 years ago

Is the threading in OpenBLAS not chache oblivious? How gets a MVM split up and distributed across multiple threads and cores?

brada4 commented 4 years ago

I was thinking once to implement something that limits number of threads so that each cache is theoretically half to completely full, though got scared of number of changes in sight (all L2/L3 BLAS). See files in interface/ - there is no way to hint less threads down the road, just all or one. And big infrastructural changes usually land somewhere between breaking things or reveling other bugs. I will try to mash up something with constants for your specific CPU, so you can play around measuring. My paid time is pressing hobbies out, if you could wait for next week, though no promise.

eastwindow commented 4 years ago

Thank you! I'm not quite following though. I can limit the number of threads by various means that OpenBLAS provides (that's what I did for my benchmarking). I'm wondering how OpenBLAS distributes the workload across threads.

A parallelized version using OpenMP of a naive MVM implementation could be

#pragma omp parallel for schedule( static )
for( int r=0; r<nrows; r++ )
  #pragma omp simd
  for( int c=0; c<ncols; c++ )
    result[r] += matrix[r][c] * vector[c];

If, for the sake of argument, nrows = 4096 here instead of 4000, and OMP_NUM_THREADS=128, the first thread will do r = [0..31], the second thread will do r = [32...63], the third r = [64...95] and so on.

Is this not how OpenBLAS parallelizes a MVM? Does it do something similar to #pragma omp parallel for schedule( static, 1 ), where the first thread will do r = [0,128,256,....], the second thread r = [1,129,257,...] and the third r = [2,130,258,...] and so on?

brada4 commented 4 years ago

#pragma omp simd is replaced by 10x faster assembly code.

eastwindow commented 4 years ago

I know. I'm wondering about the outer loop. How does OpenBLAS break up and distribute this?

brada4 commented 4 years ago

The outer loop is not x++, it is x+=8 or so, see assembly kernel names. The problem is that (multi-layer) NUMA was not something very exquisite back in the days of gotoblas, so in practice one 2/4/8 core cluster is something resembling SMP server of that day. You can have 128/x of those in your CPU, like in virtual machines. What I am about to do - make small data to not spill between assumed cache in each core, by reducing threads employed. That will use 2-3 cores of 4 core cluster in border cases, but (i hope) better than today.

eastwindow commented 3 years ago

Hi @brada4, did you have any success with this? Thanks!

brada4 commented 3 years ago

Got other errands, shall be in interface/gemm.c

martin-frbg commented 3 years ago

2846 is somewhat related and I have a tentative PR for it as #3163 - however this (and interface/gemv.c in general) addresses the low end of problem sizes (the switch from single- to multithreading) regardless of detailed cpu architecture. A gradual increase of the number of active threads would come at the expense of having more conditional coding, with the transition points possibly tailored to individual cpu models (i.e., practically unsustainable).

There may need to be related changes in driver/level2/gemv_thread.c as well

brada4 commented 3 years ago

My idea was to add additional dynamic constraint so to have at least L3-cache worth of data to process on each thread, to avoid artificially spilling between various caches on other cores. Yes, obviously L3 is nowadays shared between clusters of cores, since like Sandybridge times. I know that L3 size is unreliable at best within some virtualisation vms etc.

eastwindow commented 3 years ago

In this CPU, the L3 (16 MB) from one core-complex don't spill over to the L3 of another core-complex but into RAM.

TiborGY commented 3 years ago

A gradual increase of the number of active threads would come at the expense of having more conditional coding, with the transition points possibly tailored to individual cpu models (i.e., practically unsustainable).

Hmm, maybe the only practical way to do that would be something like ATLAS tried to do, create some sort of utility that can search (at least a part of) the space of possible tuning factor combinations during build time. Users looking for maximum performance could run the tuning utility on their own hardware, this would also take into account variability in memory speeds (i.e. DDR4-2133 vs. DDR4-3600).

But if they also contribute their tuning results here, it might just be feasible to maintain an acceptably good tuning database for multiarch builds and whatnot.

brada4 commented 3 years ago

The way it counts threads now is some size point from one to all cores. I was thinking to avoid spilling between caches by making extra constraint in addition to present logic.

eastwindow commented 1 year ago

Any updates on this issue?

brada4 commented 1 year ago

Nothing past observation that threading threshold heuristics fails on big machines and real numa partitions, to largely avoid both lock your task into a single numa node.

eastwindow commented 1 year ago

A L3 numa node is only 4 cores in AMD Zen 2. "lock your task into a single numa node." I'm not sure I understand what you mean. I believe you're not suggesting to limit the application to 4 cores.

I was able to test with MKL on the dual AMD 7742. While it breaks down when crossing CPU boundaries, it performs a lot better and the max speed is obtained at full utilazation (64 cores) of one CPU whereas OpenBLAS maxes out at 48 cores.

OpenBLAS could benefit a lot on modern many-core CPUs (Zen 4 with 96 cores is coming) if there could be some sort of cache NUMA awareness be built in. Even if it was some runtime or compile time user config. Or perhaps OpenBLAS could use libnuma to enquire the numa topology it's running on, with any heuristics?

brada4 commented 1 year ago

At small scale it cannot benefit, actually the attempt to use all cores for small task botches the wat you see. Basically CPU&caches addresses memory in 64-byte cache lines System manages memory in 4k pages NUMA assumes 32/64/128MB is local to NUMA node Whatever gets ripped in pieces between multiple CPUs causes wildly inefficient copying and locking around.

Should be in interface/gemv.c using less than available CPUs for smaller samples (smaller than which ;-) ), like guaranteeing page of input and output for each CPU core or so. Search space is huge and you need many CPU generations to test lengthily, but you can try heuristics for your CPU based on those.

TiborGY commented 1 year ago

I think the bottom line is that so far no person had the combination of time, motivation and skill to improve this. This is often the answer in open source software: patches are welcome.

brada4 commented 1 year ago

Just trying to encourage somebody to measure in direction I never had time to try.

eastwindow commented 1 year ago

Basically CPU&caches addresses memory in 64-byte cache lines System manages memory in 4k pages NUMA assumes 32/64/128MB is local to NUMA node Whatever gets ripped in pieces between multiple CPUs causes wildly inefficient copying and locking around.

Right. This is all known in advance (at runtime or compile time) and can be obtained via specific APIs. Could this information be used to partion the problem and the threads? Maybe introducing a custom allocator that creates an optimized matrix storage (with data duplication if useful), a copy function that fills this storage with user data, and a customized GEMV function that operates on this special matrix format?

brada4 commented 1 year ago

Its quite a heap to evaluate for each call, i was thinking of reduced problem - if its worth to just trust OS to balance out if we give at least 2 pages per CPU core, or is it worth to do some alignment check at work splitter. First is kind of easy: interface/gemv.c nthreads = num_cpu_avail(2); // here nthreads=min(nthreads,m*n/pagesize<<?) if (nthreads == 1) {

eastwindow commented 1 year ago

Not for each multipilaction, just once for a given matrix. I think there are many applications like mine, that multiply a new vector with the same matrix over and over. Numa balancing by the OS is often switched off in low-latency, low-jitter applications.

martin-frbg commented 1 year ago

How is OpenBLAS to know that it is "the same matrix" ? If your application has that information, maybe calling openblas_set_num_threads() with the precalculated optimum number of threads would help ?

eastwindow commented 1 year ago

How is OpenBLAS to know that it is "the same matrix" ?

If you only pass a matrix returned by the custom (numa optimized) allocator into the custom (numa optimized) GEMV, isn't that automatic? Like it only matters that the matrix and the threads are numa optimized the same way. Alternatively, some bookkeeping mechanism would be possible, too.

If your application has that information, maybe calling openblas_set_num_threads() with the precalculated optimum number of threads would help ?

I'm not sure I get the idea. I played with the number of threads and find that OpenBLAS doesn't scale well and becomes slower at some point. If I knew what OpenBLAS did internally, I could try to create an better optimized matrix allocation. But I'd be much nicer if OpenBLAS could do that itself :)

brada4 commented 1 year ago

low latency

You cannot be good to all, your cpuset trick falls under this non-default discipline of computing.

No, it is not about reinventing allocation or rewriting proces scheduler, its like optimize each side of memcpy() to make it better in common cases.

eastwindow commented 1 year ago

low latency You cannot be good to all

This issue isn't about low latency but about throughput and scalability. Just saying that one should not rely on numa balancers because a lot people don't use it.

your cpuset trick falls under this non-default discipline of computing.

CPU affinity is standard in HPC and fast NUMA.

No, it is not about reinventing allocation or rewriting proces scheduler, its like optimize each side of memcpy() to make it better in common cases.

Are fast MVMs on large SMP computers not a common use-case for OpenBLAS?

brada4 commented 1 year ago

Yout single CPU is huge vs current heuristics, it has 256MB of L3 caches vs your computation is 128MB data in total, so cache threashing occurs when all threads are activated. If you can come up with better CPU usage estimator you are more than welcome.

eastwindow commented 1 year ago

I'm using 32-bit floats, the matrix is actually only 64 MB large, corresponding to the L3 cache of 4x4 cores (16MB per complex of 4 cores)

brada4 commented 1 year ago

CPU cache is write-back, aka write-behind, so the result matrix counts towards fill(in anomalous cases the vector also) The question is how much "in typical 2023 case" should be exclusive to CPU core, initially not digging into cache complex/uncore/hyperthreads etc. Any improvement in your setup will be duly followed up. e.g.I have access to 4-12 core CPUs with shared L3 cache, so I cannot really measure.

eastwindow commented 1 year ago

"Result matrix"? are you still referring to gemv? I think in 2023+, we need to be able to adapt to different CPUs with different cache and NUMA layouts. One single call into sgemv() is probably not able to provide peak performance in this day and age. I think it would be good to have some "setup" calls where users can partition the tasks, both into data chunks and affinities. Or defining the cache and memory layouts at compile time, perhaps even create binary code for specific matrix sizes in advance, or JIT compile them :)

If you have any test that you'd like results from in the AMD Zen 2, I can probably run those for you.

If you could point me at the code, where OpenBLAS splits up the MVM into threads and where it decides what thread get what chunk, I'd be courious to see if I can understand it and modify it and perhaps improve locality. I don't really now much about efficient GEMV algorithms and how OpenBLAS works though....

brada4 commented 1 year ago

The input matrix is not mandated to be square... Probably thats very small thing to consider in average case.