OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.24k stars 1.48k forks source link

performance on AMD Ryzen and Threadripper #1461

Open tkswe88 opened 6 years ago

tkswe88 commented 6 years ago

This report comes right from #1425 where the discussion drifted off from thread safety in openblas v. 0.2.20 to performance on AMD Ryzen and Threadripper processors (in this particular case a TR 1950X). I seems worthwhile to discuss this in a separate thread. Until now we had the following discussion

@tkswe88 : After plenty of initial tests with the AMD TR 1950X processor, it seems that openblas (tested 0.2.19, 0.2.20 and the development version on Ubuntu 17.10 with kernel 4.13, gcc and gfortran v. 7.2) operates roughly 20% slower on the TR1950X than on an i7-4770 (4 cores) when using 1, 2 and 4 threads. This is somewhat surprising given that both CPUs use at most AVX2 and thus should be comparable in terms of vectorisation potential. I have already adjusted the OpenMP thread affinity to exclude that the (hidden) NUMA architecture of the TR1950X causes its lower performance. Other measures I took were 1) BIOS upgrade, 2) Linux kernel upgrade to 4.15, 3) increased DIMM frequency from 2133 to 2666 MHz. Except for the latter, which gave a speedup of roughly 3%, these measures did not have any effect on execution speed. Do you have any idea where the degraded performance on the TR1950X comes from? Is this related to a current lack of optimization in openblas or do we just have to wait for the next major release of gcc/gfortran to fix the problem? Of course, I would be willing to run tests, if this was of help in developing openblas.

@brada4 : AMD has slightly slower AVX and AVX2 units per CPU, by no means slow in general, it still has heap of cores spare. Sometimes optimal AVX2 saturation means turning whole CPU cartridge to base, i.e non-turpo frequency.

@martin-frbg: Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*).

@tkswe88: Are there any tests I could do to find out about cache size detection errors or more appropriate settings for loop unroling?

@brada4 What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0

@martin-frbg: You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter). As far as appropriate settings go, the benchmark directory contains a few tests that can be used for timing individual functions. Adjusting the values in param.h (see https://github.com/xianyi/OpenBLAS/pull/1157/files) is a bit of a black art though.

yubeic commented 6 years ago

@tkswe88 Thanks a lot for your patient answer and I was surprised how detailed it was. I have played with 1950x for a while and did more benchmark. My current conclusion is that the software is still behind:

  1. openBLAS was release before threadripper release.
  2. Several tool chains including numpy, pytorch, tensorflow all have a similar issue when it comes to a faster prototyping of our models.
  3. Blis is not there yet either.

I feel that with some more effort maybe threadripper will work as well as an Intel system. But my time is seriously limited. :p I will definitely try AMD system once the software is ready. We had a discussion in the lab and all think the openBLAS project is a very important project. We wish more funding will be provided to support this line of research and development.

So I will switch to an i9-7960x setup instead to avoid too much tuning and all of the servers (specifically for machine learning) in our lab will be built with Intel platform for the moment.

Also huge thanks to @brada4 @martin-frbg, the comments are very informative.

brada4 commented 6 years ago

@yubeic

  1. You have to compile github development version. 0.2.20 pre-dates threadripper, zen, and skylake. OPENBLAS_CORETYPE=Haswell may fix up recent pre-packaged binaries.

  2. mind to share it is OSX or Windows? Are you 100% sure all else is wrong but your coding?

yubeic commented 6 years ago

@brada4 I only did some very basic benchmarks, which contains only matrix-vector multiplication, matrix-matrix multiplication, SVD, Symmetrix matrix eigendecomposition etc. All of our system are Ubuntu 16.04.2-16.04.4 LTS. In my benchmarks, I tried many different preoblem sizes with both double and single precision in both numpy and pytorch. I think the code is fair but I'm almost sure I didn't compile it correctly. I didn't even compile Pytorch with openblas due to a time limit, so it runs much slower than openblas numpy on threadripper. I figured it probably uses mkl natively from the current official channel. I will try my best to benchmark more according to your suggestion. I do have to say that it might be better if AMD can provide some good tutorial on how to achieve a competitive scientific computing performance on these high end CPUs. But my peers think this is a too small market, which is reasonable. It doesn't look like there is much an effort to make threadripper great for scientific computing from the software side. We were seriously thinking about building more AMD based servers given the amazing specs. It's too bad that I don't have much more time to investigate this. However, the other theoriests don't even bother trying to do something like this .. my peers were all waiting to see my results. :thinking: AMD is the hardware level, we are at the application level, the programming interface gap is daunting. That's why I started my question with whether there is a hope to achieve 80% of 7900x performance on linear algebra with mild effort, if so it will make me more comfortable. I have to say my quick impression is not positive.. I will try to make a benchmark comparison between i9 7960x and 1950x later.

Btw, how far is a Threadripper optimized openBLAS to be released? I think a lot of people wonder if 1950x + openblas is almost equivalent to i9+7960+mkl when it comes to linear algebra. BLAS really should be Threadripper's territory from a first look. The only few benchmarks out there look really bad.

martin-frbg commented 6 years ago

Please be aware that OpenBLAS is a volunteer effort with little and sporadic outside funding. I am still not aware of any "cheat sheet" type document that tells exactly what to change in existing "Intel-biased" assembly to achieve comparable performance on AMD.

brada4 commented 6 years ago

LTS ubuntu is unlikely to get any upgrades OpenBLAS past what was shipped with distribution (0.2.18 ->0.3.0). Enterprise grade support is not orchestrated from this "upstream source" btw.

Please confirm badness with at least one of following:

Ubuntu pre-packaged 0.2.18 will work (better) with ZEN CPUs setting OPENBLAS_CORETYPE=Haswell

You can use develop version from github (if no git or svn at hand just use green button to download .zip) https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts

yubeic commented 6 years ago

@martin-frbg I understand and that's why I think some cooperation should seriously stand behind this project, especially now tensor computation is increasingly more important. The libraries and programming models are too important to ignore, which is true for both Vega and Ryzen.

@brada4 This sounds interesting and I will play with it tomorrow. Thanks guys!

yubeic commented 6 years ago

@brada4 What's the 'best' compilation configuration and environment setting for Threadripper 1950x? I will use the develop branch and compile it again with the optimal configuration. Then I will benchmark more and report back the numbers.

I currently just compile openBLAS from the develop branch and set the environment variable to: OPENBLAS_CORETYPE=Haswell and OPENBLAS_NUM_THREADS=16. Please let me know if there is wrong, thanks!

Also I might compile numpy with Blis to compare further.

brada4 commented 6 years ago

One was enough. I dont see numeric output from you... Visible slowness is like 10x regression which I doubt

yubeic commented 6 years ago

@brada4 I was able to compile numpy (newest stable release) with python 2.7.12 shipped with Ubuntu 16.04.4. But Anaconda python keep complaining about some unrecognized setuptool commands. Along the way, I found the compilation is need for many different packages including CuPy, PyTorch, Torchvision, scipy, and many other packages ...

Once the bugs are fixed, numeric results follow. Anyway, here I got a few early numbers. Comparison: Platform1: i7-6700K 4.2G + mkl, 4 threads, RAM: DDR4-2400 64GB, Dual Channel Platform2: 1950x 3.85G + openBLAS, 16 threads, OPENBLAS_CORETYPE=HASWELL (might worth trying EXCAVATOR too for dgemm), RAM: DDR4-2666 128GB, Quad Channel

Task 1: Large Matrix MM, size 20000 x 20000, double precision: P1: 36.04sec P2: 22.70sec

Taks 2: Medium Matrix Eigendecomposition, 4000x4000, double precision: P1: 21.61sec P2: 35.24sec

Overall, the performance on 1950x has improved quite a bit, especially the matrix decomposition. But since openBLAS uses netlib's LAPACK, I'm not sure if the poor factorization performance comes from that side or the binding.

But these tasks are quite memory bound, especially for the first one. A more systematic benchmark report will be provided when I get some more time this week, also with Blis 0.95. I'm curious about the results. My i9-7960x system will be ready soon. Once it's up, I will include some benchmark from it too.

brada4 commented 6 years ago

None is memory bound for large matrix sizes (rule of thumb 10 flops can be done per memory IO, it is memory bound for very small parameter sizes)

_GEMM (P1) reads/writes O(n^2) memory while doing O(n^3) floating ops. _GE__ (P2) among others runs O(n^1.x) _GEMM-s, thus a bit above O(n^3) in CPU and O(n^2) in memory. There are dozen of eigenvalue algorithms in LAPACK, probably MKL uses somehow better than called one for particular input matrix (size). OpenBLAS cannot influence lapack and overload calls at that high level.

During branching ZEN it was found that haswell template performs better right away than excavator. Since you built dynamic arch library, you can try tests on the spot.

What we found examining numbers is that OpenBLAS performs reasonably well. Anaconda's OpenBLAS (again not orchestrated from this project) is called "nomkl", it needs coretype set to haswell.

yubeic commented 6 years ago

@brada4 Thanks for sharing the information. I found 1950x's performance is very volatile. Part of the reason might be that 1950x drops its speed to about 2.1GHz by default. I have to fix some BIOS feature to give a more stable result.

tkswe88 commented 6 years ago

@yubeic For some reason your latest benchmarks do not seem to show on the github page. Nevertheless, I did some benchmarking in python, following https://stackoverflow.com/questions/29559338/set-max-number-of-threads-at-runtime-on-numpy-openblas to ensure that I used the right openblas library. I tested different numbers of threads and got the following results on my TR1950X (3.6 GHz core speed, 2666 MHz quad-channel DRAM, NUMA setup):

16 threads: 20Kx20K,MM:--- 40.8606879711 seconds --- 10Kx10K,MM:--- 5.248939991 seconds --- 8Kx8K,MM:--- 2.93858385086 seconds --- 4Kx4K, MM: --- 0.43017911911 seconds --- 4Kx4K, Eig:--- 30.7862281799 seconds --- 1Kx1K, MM: --- 0.0127520561218 seconds --- 1Kx1K, Eig:--- 0.770863056183 seconds ---

8 threads: 20Kx20K,MM:--- 79.3266539574 seconds --- 10Kx10K,MM:--- 9.90933585167 seconds --- 8Kx8K,MM:--- 5.13482499123 seconds --- 4Kx4K, MM: --- 0.669865846634 seconds --- 4Kx4K, Eig:--- 28.7336189747 seconds --- 1Kx1K, MM: --- 0.0174260139465 seconds --- 1Kx1K, Eig:--- 0.674449920654 seconds ---

4 threads: 20Kx20K,MM:--- 154.140152931 seconds --- 10Kx10K,MM:--- 19.7171549797 seconds --- 8Kx8K,MM:--- 9.97354102135 seconds --- 4Kx4K, MM: --- 1.27706003189 seconds --- 4Kx4K, Eig:--- 31.2966649532 seconds --- 1Kx1K, MM: --- 0.0288860797882 seconds --- 1Kx1K, Eig:--- 0.646577119827 seconds ---

I am somewhat puzzled that the 20k*20k matrix multiplication on my system is even worse than on yours. Anyhow, it is clear that the lapack/openblas routine for eigen decomposition called from python is not very well parallelised. Here, it is really important to pick the right one (see my previous comment, @martin-frbg and @brada4 may help).

Interestingly, I had to use some tuning of OpenMP parameters (thread affinity) export OMP_DISPLAY_ENV='true' export OMP_PROC_BIND='close' export OMP_PLACES='{0,16},{1,17},{2,18},{3,19},{4,20},{5,21},{6,22},{7,23},{8,24},{9,25},{10,26},{11,27},{12,28},{13,29},{14,30},{15,31}' to get these results. Without these settings using 16 threads, I got the following 20Kx20K,MM:--- 49.4122049809 seconds --- 10Kx10K,MM:--- 6.79072284698 seconds --- 8Kx8K,MM:--- 2.77755713463 seconds --- 4Kx4K, MM: --- 0.380570888519 seconds --- 4Kx4K, Eig:--- 34.6536951065 seconds --- 1Kx1K, MM: --- 0.0201449394226 seconds --- 1Kx1K, Eig:--- 0.839740037918 seconds --- as the best result.

Generally, I would recommend to narrow down the comparison to the i7-6700 by using openblas on that system, too. Otherwise, you will have difficulty understanding what parts of the runtime differences are related to the architectures (TR1950X vs i7-6700) and the libraries (mkl and openblas).

When you suspect thermal throttling to be a problem, it would be helpful to monitor core speeds. I use an NH-U14S TR4-SP3 air cooler and can run stably at 3.6 and 3.75 GHz without throttling it seems.

brada4 commented 6 years ago

If you have heat dissipation problems you cannot expect any performance at all. Fix thermal setup first. Can you check with numactl -H if NUMA table is set up correctly , i.e. two NUMA pseudonodes. Failure here is of mobo maker, and costs you in range 10-20% RAM bandwidth in average case, like 80-90% in worst case.

yubeic commented 6 years ago

@tkswe88 @brada4 Yesterday the number was quite volatile, partially due to I didn't turn of the CPU feature Quite 'n Cool. This feature dials down the clock speed very often. After I turned it off, now the performance is much more stable. Here is the number report again:

Platform 1: i7-6700K 4.2G + MKL, DDR4-2400 64GB Dual Channel 30Kx30K,MM:--- 121.266254902 seconds --- 20Kx20K,MM:--- 36.3553829193 seconds --- 10Kx10K,MM:--- 4.55528593063 seconds --- 8Kx8K,MM:--- 2.26043701172 seconds --- 4Kx4K, MM: --- 0.312715053558 seconds --- 4Kx4K, Eig:--- 21.2994120121 seconds --- 1Kx1K, MM: --- 0.00685691833496 seconds --- 1Kx1K, Eig:--- 0.343578100204 seconds ---

Platform2: i7-6850K 3.6G + MKL, DDR4-2400 64GB Quad Channel 20Kx20K,MM:--- 27.84059715270996 seconds --- 4Kx4K, MM: --- 0.3708188533782959 seconds --- 4Kx4K, Eig:--- 21.92760944366455 seconds --- 1Kx1K, MM: --- 0.006255626678466797 seconds --- 1Kx1K, Eig:--- 0.40999412536621094 seconds ---

Platform3: 1950x 3.8G + openBLAS, DDR4-2666 128GB Quad Channel 30Kx30K,MM:--- 69.3856248856 seconds --- 20Kx20K,MM:--- 22.5743000507 seconds --- 10Kx10K,MM:--- 3.56968283653 seconds --- 8Kx8K,MM:--- 2.06472492218 seconds --- 4Kx4K, MM: --- 0.443926095963 seconds --- 4Kx4K, Eig:--- 31.6315670013 seconds --- 1Kx1K, MM: --- 0.00886201858521 seconds --- 1Kx1K, Eig:--- 0.971400976181 seconds ---

I included an even larger matrix 30Kx30K. It seems the performance of 1950x with large matrices is almost 2x of the performance of i7. This really worries me that the speed gain might mainly come from the memory bandwidth. The MM performance on my machine is generally much faster (except for the 4Kx4K one) than @tkswe88's performance, but my SVD performance is worse. But I agree the the scalibility of SVD is an big issue. I just followed the general formula to compile numpy with openBLAS with the site.cfg file. I also included some numbers from an i7-6850k platform since it also has quad channel, it seems it only outperform the 6700k platform when matrices get large. Even not overclocked, assuming a linear scalability, it should still outperform the 6700k. I'm confused by these results now.

One thing about the eigendecomposition is that though 16 threads are running, the CPU utilization doesn't seems to be very high, the following is the utilization curves of each of the cores during 4Kx4K eigendecomposition:

utilization

The heat might be an issue, but given I only have a Corsair H100i v2 for this CPU, I can only turn down the OC to make things better. I wasn't able to install the thermal sensor chip driver for this mobo, so currently I can not measure the internal temperature exactly. The cooler's sensor reads fine, below 58 C.

When it comes to matrix factorization and small matrices, i7 platform still wins by a non-trivial margin. I printed NUMA info here: available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 128806 MB node 0 free: 122155 MB node distances: node 0 0: 10 This doesn't seems right. Do you know how to fix it? @brada4

Next I compiled numpy with BLIS (0.95Beta MT version from AMD website) on 1950x, and set BLIS_NUM_THREADS=16. The results are .. abysmal .. which makes me believe I must compiled it in a very dumb way: 30Kx30K,MM:--- 209.907198191 seconds --- 20Kx20K,MM:--- 62.3285088539 seconds --- 10Kx10K,MM:--- 7.86016511917 seconds --- 8Kx8K,MM:--- 4.07193017006 seconds --- 4Kx4K, MM: --- 0.58310008049 seconds --- 4Kx4K, Eig:--- 337.510659933 seconds --- 1Kx1K, MM: --- 0.0233819484711 seconds --- 1Kx1K, Eig:--- 5.63355708122 seconds ---

Numpy BLAS link info: lapack_info: NOT AVAILABLE lapack_opt_info: NOT AVAILABLE openblas_lapack_info: NOT AVAILABLE atlas_threads_info: NOT AVAILABLE openblas_clapack_info: NOT AVAILABLE atlas_3_10_threads_info: NOT AVAILABLE lapack_src_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE blas_opt_info: language = c define_macros = [('HAVE_CBLAS', None)] runtime_library_dirs = ['/home/yubeic/SysBoost/blis/lib'] libraries = ['blis', 'blis'] library_dirs = ['/home/yubeic/SysBoost/blis/lib'] include_dirs = ['/home/yubeic/SysBoost/blis/include/blis'] blis_info: language = c define_macros = [('HAVE_CBLAS', None)] runtime_library_dirs = ['/home/yubeic/SysBoost/blis/lib'] libraries = ['blis', 'blis'] library_dirs = ['/home/yubeic/SysBoost/blis/lib'] include_dirs = ['/home/yubeic/SysBoost/blis/include/blis'] atlas_info: NOT AVAILABLE atlas_3_10_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE

I think there is no parallel LAPACK linked, so the eigendecomposition numbers should be ignored. I have to say it's not that easy to get these working ..

brada4 commented 6 years ago

Choose the best option that works for you? You must contact BIOS vendor to set NUMA properly or try fake numa: http://linux-hacks.blogspot.com.es/2009/07/fake-numa-nodes-in-linux.html (It was seen that threadripper is not numbered in order like intel, so it might be tough to get right conf for +20% perf bonus) In case you get NUMA right you may try smaller samples on half CPU aka numa node.

Result should look something like this (this is a virtual machine with all numbers round and beautiful):

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 64511 MB
node 0 free: 41721 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23
node 1 size: 62464 MB
node 1 free: 45089 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

i.e reflect 2 CPU halves with independent memory channels, your cores will not be in counting order.

tkswe88 commented 6 years ago

@yubeic To switch between NUMA and UMA you can try two options: 1) in the bios, if the mobo manufacturer provides this functionality (the ASUS Prime X399-A mobo I use does not provide this). 2) AMD Ryzen Master Utility which is only for Windows. I used this option even though I had to beg our a sysad for a hard disk to install Windows on. To me it does not seem a good idea to fake NUMA nodes as @brada4 suggested. The TR1950X is naturally a NUMA architecture and the UMA functionality provided by AMD is just an emulation. So faking NUMA nodes in a UMA emulation is faking a fake in a sense.

I am unsure what thread numbering problem @brada4 refers to, but I always get this numbering using numactl -H:

available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 32117 MB node 0 free: 25690 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 32221 MB node 1 free: 27757 MB node distances: node 0 1 0: 10 16 1: 16 10

When limiting jobs to run on 4 threads and taking care to have affinity of these 4 threads to just one CCX units (which has 4 cores), I get somewhat (20% in some cases) improved performance on bandwidth limited operations. The improvements are consistently lower when using 8 specific cores (distributed over two CCX that is) and vanish when I do not care about thread affinity.

In short, the threads are distributed over the CCX as CCX0 0 1 2 3 16 17 18 19 CCX1 4 5 6 7 20 21 22 23 CCX2 8 9 10 11 24 25 26 27 CCX3 12 13 14 15 28 29 30 31 So getting the numbering for thread affinity right is quite simple.

brada4 commented 6 years ago

@tkswe88: yup, same 20% i am talking about. i think if fake numa can be set still @yubeic must cross-check performance results, because I have some doubts about that conf - if BIOS set memory to interleaved mode he will not get full NUMA, just some partial effect from affining CPU caches closer to what is on the chip, and the last crumble for performance should be squeezed from mobo vendor shipping proper NUMA config in BIOS/UEFI

yubeic commented 6 years ago

@brada4 @tkswe88 I checked the motherboard, MSI X399 Carbon has an option to set memory interleaving to channel, which should switch from UMA to NUMA. Some discussion says it doesn't work properly, I will try and test.

brada4 commented 6 years ago

you must compare simple (gemm) timings with @tkswe88 , 20% difference means it does not work properly. you need memory module in each channel, usually motherboard guide says about RAm slot population order. 4 channels means 4+ RAM modules....

yubeic commented 6 years ago

@brada4 I'm able to get the NUMA setting to: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 64307 MB node 0 free: 51021 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 64498 MB node 1 free: 51840 MB node distances: node 0 1 0: 10 16 1: 16 10 But the performance of the benchmark becomes worse than UMA, which is really confusing. And this change somehow also makes other applications to have a periodic large latency. There are other options of memory interleaving rather than channel.

yubeic commented 6 years ago

I just did another benchmark on i9-7960x + mkl (default version shipped with Anaconda, MKL 2018.0.2). The following is the result:

Platform 4: 7960x 2.8G + MKL 2018.0.2, DDR4-2666 128GB Quad Channel 30Kx30K,MM:--- 34.8145999908 seconds --- 20Kx20K,MM:--- 11.1714441776 seconds --- 10Kx10K,MM:--- 1.65482592583 seconds --- 8Kx8K,MM:--- 0.960442066193 seconds --- 4Kx4K, MM: --- 0.162860155106 seconds --- 4Kx4K, Eig:--- 17.041315794 seconds --- 1Kx1K, MM: --- 0.00704598426819 seconds --- 1Kx1K, Eig:--- 0.410210847855 seconds ---

After increase VCPUIN voltage from about 1.8V to 2.0V, the result is roughly: 30Kx30K,MM:--- 26.9505310059 seconds --- 20Kx20K,MM:--- 8.86561799049 seconds --- 10Kx10K,MM:--- 1.42309403419 seconds --- 8Kx8K,MM:--- 0.801309108734 seconds --- 4Kx4K, MM: --- 0.151351213455 seconds --- 4Kx4K, Eig:--- 15.6785330772 seconds --- 1Kx1K, MM: --- 0.00755000114441 seconds --- 1Kx1K, Eig:--- 0.390221118927 seconds ---

LukePogaPersonal commented 6 years ago

so Threadripper is half as fast as a 7960x. is that about as good as it will ever get?

brada4 commented 6 years ago

All ZEN re-uses haswell code, which was measured better than excavator, there is nothing yet specifically tuned for ZEN. I dont see anything unfair in double speed for double price for now.

the "periodic large latency" comes from forced process migration between numa nodes. Use one numa node and it will perform better than UMA 8 cores.

tkswe88 commented 6 years ago

Just a few comments:

@yubeic : When using NUMA, you have to make sure that the threads and the associated data (first touch principle applies) start moving across the UMA and CCX nodes by ensuring thread affinity. When using only four threads and making sure that they stay on the same CCX, the performance has to be better than on 4 threads in UMA. So this is basically the same advice as @brada4 has given.

@yubeic and @LukeSBE : The TR 1950X has two 128bit AVX FMA units, whereas the i9-7960X has two AVX-512 FMA units. Given the heat throttling that is used in Intel processors in AVX 512 mode, one might have expected the i9-7960X to be about a factor of three faster than the TR1950X at the same clock speed. Taking into the clock speeds of 3.8 GHz (TR) and 2.8 GHz (i9; Did the i9 run at 2.8GHz using AVX-512 on all 16 cores? ), the TR 1950X does slightly better than I expected in a comparison that emphasises the vector units. This may (or may not) be related to the larger L2 cache in the TR1950X. In less vectorial workloads the TR 1950X can be as fast as the i9-7960X (or possibly a little faster) depending on the case. So the question is: Do you multiply matrices and vectors all day long? If so and money does not matter, the Intel processors may be the better choice. However, using GPUs may seem even more appropriate.

brada4 commented 6 years ago

@tkswe88 regarding taskset - on modern Linux kernels some rudimentary numa node balancer exists, so in principle after few warmup rounds your numa-node or less sized process is in a good placement. e.g. check the configuration of balancer in kernel running:

grep NUMA_BALANCING /boot/config-`uname -r`

or enable now and on boot if respective sysctl interface is present:

sysctl kernel.numa_balancing=1 >> /etc/sysctl.conf
yubeic commented 6 years ago

All the following performance should not be treated as over-clocking performance. The overclocking was misleading. Please read the article provided by @brada4 in later comments. I was trying to overclock the CPU, but it never runs at an overclocked rate during the all core numerical test.

@tkswe88 I slightly overclock 7960x to "3.8Ghz" too to make a comparison: ("3.8"GHz might be false since when all core start to run, the frequency dropped to about 3.0Ghz, so this overclocking was not successful.)

30Kx30K,MM:--- 26.9505310059 seconds --- 20Kx20K,MM:--- 8.86561799049 seconds --- 10Kx10K,MM:--- 1.42309403419 seconds --- 8Kx8K,MM:--- 0.801309108734 seconds --- 4Kx4K, MM: --- 0.151351213455 seconds --- 4Kx4K, Eig:--- 15.6785330772 seconds --- 1Kx1K, MM: --- 0.00755000114441 seconds --- 1Kx1K, Eig:--- 0.390221118927 seconds ---

The power draw at this frequency is huge, about 290watt when AVX kicks in. If the overclocked to 4.2 Ghz, 30Kx30K MM can get to about 25.6 seconds, which is really marginally improved.

@tkswe88 In fact from the information showed by turbostat, it seems once AVX512 is used on all cores, their frequency all throttled at 3.0GHz - 3.3GHz (Even the whole processor is clocked to 3.8GHz.) But given the power draw is already 290Watt, I don't think there is much room for a higher AVX frequency.

I only use this workstation for numerical simulations. Otherwise I will probably just get a 8700K. Most of my processing is send to the GPUs, but unfortunately, some of my models need eigendecomposition. The eigendecomposition performance can easily become a bottleneck if I launch more than 2 processes. At that point launch more GPU processes will slow down the running jobs. So the major problem with Threadripper was that the SVD and eigendecomposition was taking too long. Otherwise, it would still be usable.

If fact, even with this i9 CPU, svd performance is still really bad. Here is the ibench test result ("3.8Ghz", in fact, the result is really a stock performance ...): Lu: Lu: N = 35000 Lu: elapsed 40.091505 gflops 712.952365 Lu: elapsed 39.271447 gflops 727.840086 Lu: elapsed 38.881067 gflops 735.147863 Lu: gflops 727.840086 Cholesky: Cholesky: N = 40000 Cholesky: elapsed 20.265417 gflops 1052.696484 Cholesky: elapsed 19.970937 gflops 1068.218948 Cholesky: elapsed 19.938883 gflops 1069.936228 Cholesky: gflops 1068.218948 Inv: Inv: N = 25000 Inv: elapsed 37.933589 gflops 823.808151 Inv: elapsed 38.142073 gflops 819.305235 Inv: elapsed 37.802702 gflops 826.660487 Inv: gflops 823.808151 Fft: Fft: N = 520000 Fft: elapsed 0.383207 gflops 128.831636 Fft: elapsed 0.368043 gflops 134.139768 Fft: elapsed 0.367436 gflops 134.361283 Fft: gflops 134.139768 Det: Det: N = 30000 Det: elapsed 17.214028 gflops 1045.658801 Det: elapsed 16.712926 gflops 1077.010698 Det: elapsed 16.527369 gflops 1089.102565 Det: gflops 1077.010698 Svd: Svd: N = 10000 Svd: elapsed 108.278476 gflops 12.313928 Svd: elapsed 108.142802 gflops 12.329377 Svd: elapsed 108.045384 gflops 12.340493 Svd: gflops 12.329377 Dot: Dot: N = 10000 Dot: elapsed 1.473358 gflops 1357.443414 Dot: elapsed 1.455766 gflops 1373.847206 Dot: elapsed 1.454956 gflops 1374.611964 Dot: gflops 1373.847206 Qr: Qr: N = 10000 Qr: elapsed 5.096478 gflops 261.618580 Qr: elapsed 5.080871 gflops 262.422192 Qr: elapsed 5.087934 gflops 262.057906 Qr: gflops 262.057906

It seems SVD is very hard to parallelize and I probably will avoid using all cores on SVD problems since it's not scalable and it takes a lot of resources.

brada4 commented 6 years ago

You have 2 SVD algorithms in lapack: http://www.netlib.org/lapack/lug/node32.html Depending on typical input you have one may be faster than other. Yes, they call GEMM on small matrices, in order, depending on previous GEMM output, and OpenBLAS uses one core for small GEMM. Thats not a performance poblem, just a normal life sign of incremental generations algorithms for SVD.

If you overclock x86_64 core AVX core gets clocked down, and if you overclock even more then uncore (caches etc) clocks down. Fair comparison is with factory clocks, please. ARM and AMD has same sort of self-regulation built in.

yubeic commented 6 years ago

@brada4 I digged into it a little further and I found my overclocking multiplier setting was not effective under Linux at all. Well, I can see through turbostat that they are allowed to be 3.8GHz. But they have never actually achieved this frequency under load. The reason for the performance improvement was purely from a change of VCPUIN from 1.8V to 2.0V. 👎 So the above score is in fact a stock turbo boost performance. However the performance I'm getting might be even worse than the stock turbo boost performance since I have never seen the promised 4.4Ghz single core turbo boost at all, they are also bounded below 3.7Ghz at turbo. I'm not sure why the overclocking under linux is not successful, it might be a system kernel bug (say in intel_pstate driver) or motherboard firmware bug I guess. Possibly the problem is in Intel microcode, but I don't want to disable it since Skylake-X has a few critical bugs, e.g. security and hyper threading crash.

brada4 commented 6 years ago

This is not overclocking forum. Your benchmark measures frequency on a single core with others idle. It is absolutely false information. Spinning all cores you get stock base frequency. Example measurement for older CPU series: https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-intel-xeon-e5-2600v4-broadwell-ep-processors/

yubeic commented 6 years ago

@brada4 I think you are right, my overclocking wasn't successful. And I did a bunch of benchmarks using phoronix and find the scores I got are constantly below (usually 2-5%) the scores from others reported. So I'm going to delete the misleading benchmarks and only keep a stable but optimized stock frequency setting result. Also thanks for the reference, it's a good read. Once I can solve the issue, I will update the info for a fair comparison. Do you know if the AVX512 puts a hard threshold on the all core frequency? It seems that no matter what I do, AVX512 all core can not go beyond 3GHz.

brada4 commented 6 years ago

This is not an overclocking forum....

martin-frbg commented 6 years ago

I guess data from overclocked systems will still be useful as long as they are marked as such. (And unless you run the latest-and-greatest develop branch, OpenBLAS will not make any use of AVX512 yet, even with develop you will only see it used for GEMM. For AMD Zen architecture, BLIS should have a clear advantage as nobody here has worked on Zen-specifc code yet)

brada4 commented 6 years ago

At some point of overclocking you start getting prominent timings while losing the correctness due to bit flips inevitably happening in overclocked components. More voltage is needed to flip semiconductor state at higher frequencies, which in turn yields Freq^2 thermal output and power consumption while you have just one air flow to chill CPU before it fries... Your mileage may vary, and that last smell from CPU is not of pleasant and healthy ones.

LukePogaPersonal commented 6 years ago

Zen has been around for 16 months, and OpenBLAS has not been updated to not support it (I do not regard 50% speed per core vs Intel as supported). Is OpenBLAS being actively developed? There is a Threadripper 32 core coming out soon. It seems a waste of good hardware to have out there if no vector libraries take advantage of it!

martin-frbg commented 6 years ago

OpenBLAS is a volunteer project with few active developers and little if any external funding. Just sitting there shouting "do something for me" is unlikely to bring any useful results.

martin-frbg commented 6 years ago

From the results seen in https://github.com/xianyi/OpenBLAS/commit/6eb4b9ae7c7cc58af00ac21b52fed8810d7e5710 , it would probably make sense to explore increasing the SWITCH_RATE for ZEN in param.h as well. Anybody with a Ryzen or Threadripper willing to try ?

tkswe88 commented 6 years ago

I can run a test tomorrow. Would 4, 8, 16 and 32 be meaningful test values or would you like to see other values tested?

martin-frbg commented 6 years ago

I'd try 32 first as that seemed to fit Haswell, and then a step to both sides ?

LukePogaPersonal commented 6 years ago

Apologies I did not realise TR FMA was 4 times less than Skylake-X. The fact that it gets half performance shows it's pretty well supported already.

martin-frbg commented 6 years ago

Probably need to try some kind of hybrid kernel between Haswell and Sandybridge, to see if we can avoid the relative AVX2 weakness of Zen. I do not have the hardware to try this myself at the moment however.

jcolafrancesco commented 5 years ago

Here are my results on a threadripper 1900x. I've tested higher switch rates as you requested martin. ~I'm gaining a little doing so (i'm repeatedly gaining 1 sec on 20kx20k matricies) but it's not a game changer.~ (Following tests shown that those variations was in my error margin, so we should not conclude anything).

OpenBlas commit 71c6deed60c4b Threadripper 1900X, asus prime X399, 4x8GB DDR4 2800MHz. Didn't touched to anything related to UMA/NUMA, so i suppose i'm in UMA mode.

8 threads standard config 20Kx20K,MM:--- 76.631098 seconds --- 10Kx10K,MM:--- 9.659299 seconds --- 8Kx8K,MM:--- 5.007382 seconds --- 4Kx4K,MM:--- 0.638273 seconds --- 1Kx1K,MM:--- 0.013002 seconds ---

8 threads SWITCH_RATE=32 20Kx20K,MM:--- 75.376612 seconds --- 10Kx10K,MM:--- 9.583702 seconds --- 8Kx8K,MM:--- 4.961513 seconds --- 4Kx4K,MM:--- 0.639170 seconds --- 1Kx1K,MM:--- 0.012943 seconds ---

8 threads with thread affinity (don't know why i've tested that ! but no 20Kx20K,MM:--- 76.205891 seconds --- 10Kx10K,MM:--- 10.457863 seconds --- 8Kx8K,MM:--- 4.982340 seconds --- 4Kx4K,MM:--- 0.636596 seconds --- 1Kx1K,MM:--- 0.012721 seconds ---

not inside with_thread(8) 20Kx20K,MM:--- 86.750650 seconds --- 10Kx10K,MM:--- 11.032390 seconds --- 8Kx8K,MM:--- 5.676041 seconds --- 4Kx4K,MM:--- 0.740049 seconds --- 1Kx1K,MM:--- 0.015212 seconds ---

My results and tkswe88's on 8 threads seems very close.

What i don't understand is yubeic's results on 16 cores 1950x :

20Kx20K,MM:--- 22.5743000507 seconds --- 10Kx10K,MM:--- 3.56968283653 seconds --- 8Kx8K,MM:--- 2.06472492218 seconds --- 4Kx4K, MM: --- 0.443926095963 seconds --- 4Kx4K, Eig:--- 31.6315670013 seconds --- 1Kx1K, MM: --- 0.00886201858521 seconds --- 1Kx1K, Eig:--- 0.971400976181 seconds ---

Far better than tkswe88 with the same processor, how can we explain such disparity ?

I'm ready to do some more testing if needed.

brada4 commented 5 years ago

Can you show numactl -H? Memories have to be installed on different memory channels for fake numa to do anything.

jcolafrancesco commented 5 years ago

Forgot to mention i'm on linux 4.15.0-36-generic and results are obtained through numpy.

Here it is.

available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 32085 MB node 0 free: 29216 MB node distances: node 0 0: 10

I've juste tested TARGET=SANDYBRIDGE I though it would be an interesting test because, similarly to zen, Sandy Bridge only has 2*128bits FMA per core (although zen has 2 more 128bits FADD) . Results : 20Kx20K,MM:--- 136.080165 seconds --- 10Kx10K,MM:--- 17.193669 seconds --- 8Kx8K,MM:--- 8.815149 seconds --- 4Kx4K,MM:--- 1.117620 seconds --- 1Kx1K,MM:--- 0.022499 seconds --- This is not good.

Do you have an intuition of why Haswell better handles threadripper than sandybridge ? By isolating this, we could try hybrid kernel as martin suggested.

martin-frbg commented 5 years ago

Reopening for better visibility, but no immediate intuition. Perhaps compiler flags play a more important role than we assumed so far, default is just -O2 without any -march= setting.

jcolafrancesco commented 5 years ago

With O3 , keeping everything else to default :

20Kx20K,MM:--- 75.296078 seconds --- 10Kx10K,MM:--- 9.579083 seconds --- 8Kx8K,MM:--- 4.960849 seconds --- 4Kx4K,MM:--- 0.632421 seconds --- 1Kx1K,MM:--- 0.013105 seconds ---

With O3 and SWITCH_RATE=32

20Kx20K,MM:--- 75.662260 seconds --- 10Kx10K,MM:--- 9.617999 seconds --- 8Kx8K,MM:--- 4.971955 seconds --- 4Kx4K,MM:--- 0.637424 seconds --- 1Kx1K,MM:--- 0.013046 seconds ---

I have to signal that i'm too lazy to run those experiments multiple times, so It's hard to conclude anything expect that changes are marginals; we are far from something that could explain numbers obtained by yubeic.

Next experiment : changing memory interleaving to "channel" in bios. I'm now observing :

numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 15991 MB
node 0 free: 14926 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 16093 MB
node 1 free: 15213 MB
node distances:
node   0   1 
  0:  10  16 
  1:  16  10 

which seems right.

But results are sensibly worse on 20k matrices: 20Kx20K,MM:--- 79.732724 seconds --- 10Kx10K,MM:--- 10.347193 seconds --- 8Kx8K,MM:--- 5.059734 seconds --- 4Kx4K,MM:--- 0.638466 seconds --- 1Kx1K,MM:--- 0.013033 seconds --- I reran the experiment multiple times and this seems quite stable. Yubeic seems to have observed the same thing in its previous experiments

@brada4 I'm able to get the NUMA setting to: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 64307 MB node 0 free: 51021 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 64498 MB node 1 free: 51840 MB node distances: node 0 1 0: 10 16 1: 16 10 But the performance of the benchmark becomes worse than UMA, which is really confusing. And this change somehow also makes other applications to have a periodic large latency. There are other options of memory interleaving rather than channel.

brada4 commented 5 years ago

And with NUMA on (aka channel) and 8 threads (OPENBLAS_NUM_THREADS=8 ./testsuite)? Compiler optimisations are largely indifferent since biggest part of work is done by custom assembly code.

martin-frbg commented 5 years ago

Is your compiler getting -march=zen from its default configuration ? (Not that I expect much improvement from it, as the bottleneck is probably somewhere in the hand-coded assembly, but perhaps cpu-specific optimizer settings are more important than going from -O2 to -O3 ?)

jcolafrancesco commented 5 years ago

My compiler did not get -march automatically. I've tried with -march=znver1 (zen was not allowed) : no significant changes.

Where is testsuite located ?

What about SMT (it was activated for now), is there any recommandation ? I will test ASAP

martin-frbg commented 5 years ago

I can only refer to Agner Fog's analysis at https://www.agner.org/optimize/microarchitecture.pdf where (as I understand it) he comes to the conclusion that multithreading is more efficient than it was on earlier AMD and Intel designs (with the caveat that inter-thread communication should be kept within the same 4-cpu die if possible). Unfortunately I cannot identify any obvious bottlenecks in the current (Haswell) kernels even after reading his description of the microarchitectures.

martin-frbg commented 5 years ago

And I suspect brada4 was only using "testsuite" as shorthand for your current numpy code. (There are some test codes in the benchmark directory of the distribution and there is xianyi's BLAS-Tester project that is derived from the ATLAS test codes, but I am not aware of anything actually named testsuite)