OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.36k stars 1.5k forks source link

performance on AMD Ryzen and Threadripper #1461

Open tkswe88 opened 6 years ago

tkswe88 commented 6 years ago

This report comes right from #1425 where the discussion drifted off from thread safety in openblas v. 0.2.20 to performance on AMD Ryzen and Threadripper processors (in this particular case a TR 1950X). I seems worthwhile to discuss this in a separate thread. Until now we had the following discussion

@tkswe88 : After plenty of initial tests with the AMD TR 1950X processor, it seems that openblas (tested 0.2.19, 0.2.20 and the development version on Ubuntu 17.10 with kernel 4.13, gcc and gfortran v. 7.2) operates roughly 20% slower on the TR1950X than on an i7-4770 (4 cores) when using 1, 2 and 4 threads. This is somewhat surprising given that both CPUs use at most AVX2 and thus should be comparable in terms of vectorisation potential. I have already adjusted the OpenMP thread affinity to exclude that the (hidden) NUMA architecture of the TR1950X causes its lower performance. Other measures I took were 1) BIOS upgrade, 2) Linux kernel upgrade to 4.15, 3) increased DIMM frequency from 2133 to 2666 MHz. Except for the latter, which gave a speedup of roughly 3%, these measures did not have any effect on execution speed. Do you have any idea where the degraded performance on the TR1950X comes from? Is this related to a current lack of optimization in openblas or do we just have to wait for the next major release of gcc/gfortran to fix the problem? Of course, I would be willing to run tests, if this was of help in developing openblas.

@brada4 : AMD has slightly slower AVX and AVX2 units per CPU, by no means slow in general, it still has heap of cores spare. Sometimes optimal AVX2 saturation means turning whole CPU cartridge to base, i.e non-turpo frequency.

@martin-frbg: Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*).

@tkswe88: Are there any tests I could do to find out about cache size detection errors or more appropriate settings for loop unroling?

@brada4 What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0

@martin-frbg: You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter). As far as appropriate settings go, the benchmark directory contains a few tests that can be used for timing individual functions. Adjusting the values in param.h (see https://github.com/xianyi/OpenBLAS/pull/1157/files) is a bit of a black art though.

tkswe88 commented 6 years ago

@jcolafrancesco : Many thanks for the tests!

@martin-frbg and @brada4 : Sorry for not being in contact after promissing test. First, one of my RAM modules was broken. After that, I had too much teaching.

With a reservation, I can confirm @jcolafrancesco 's findings that changing the SWITH_RATE parameter has pretty marignal effect. I tested this 3 weeks ago or so with a freshly downloaded openblas version and one from spring. Here, the multi-threaded performance was worse in the recently downloaded version and basically degraded to the level, before @martin-frbg fixed issue #1468 on February 28.

Unless this description raises immediate suspicions about the degraded performance, I can download the last development version, recheck and bisect.

Also, AMD published an updated blis libflame version on their developer webpage a couple of months ago or so. I have not had the time yet to look at how much this has improved threadripper support.

tkswe88 commented 6 years ago

@jcolafrancesco : Regarding SMT, I have not seen any performance benefits from this on the TR1950X. Going beyond 16 threads, it was rather slight performance degradation by a few per cent. Generally, I recommend that you test setting thread affinity in both UMA and NUMA model to see what works best for your specific work tasks. From what I have seen, it is beneficial to set thread affinity and avoid having your data move through the InfiniFabric between the two main parts of the CPU to different DIMM modules. For my typical workloads, it is fastest to run 2or 4 simulations in parallel on 8 or 4 cores, respectively, and just do that in NUMA. I have never done any tests with SMT switched off in memory, because I assumed that setting the thread affinity would be sufficient to avoid adverse effects.

martin-frbg commented 6 years ago

@tkswe88 memory.c should be basically the same as #1468 from end of february (where you confirmed that the performance was back to earlier levels) - since then, there has been a very promising rewrite of the code with the intent to use thread-local storage. Unfortunately this new code had unexpected side effects at first, so I decided to make the "old" version the default again after two consecutive releases that were "broken" for some high-profile codes that indirectly rely on OpenBLAS. This makes the current memory.c twice as big as the original with an ugly #ifdef to switch between the two implementations at compile time.

brada4 commented 6 years ago

20k 20k 8b * 3 makes 9.6GB for double precision... Should get better with working RAM

jcolafrancesco commented 5 years ago

Yubeic also posted on AMD's website : https://community.amd.com/thread/228619

He is communicating the same results and someone is comparing them with the theoretical limit he could reach with its Threadripper (canyouseeme 23 mai 2018 04:35 ):

220k^3/22.4687=712.1GFLOPS, 3.85G8ops*16cores=492.8GFLOPS ... if it's double precision, your result exceeds the upper limit? I'm wondering if some algorithm like strassen is used inside.

Same poster is obtaining 0.36 seconds on dgemm with 4096*4096 matrices and this is totally on par with tkswe88's results (around 0.3 seconds) and mine.

Do you think that canyouseeme's reasoning is correct ? In that case it would cast some doubts on yubeic's results. I'm starting so think that we should just stop to take those results into account.

jcolafrancesco commented 5 years ago

canyouseeme is also reporting (22 mai 2018 03:04) :

ps. dgemm() run faster when OPENBLAS_CORETYPE=EXCAVATOR than HASWELL on TR 1950x.

I've made some tests with a new C code directly using the cblas API :

Target=EXCAVATOR : 
./test_cblas 1000
took 0.026141 seconds 
./test_cblas 2000
took 0.095920 seconds 
./test_cblas 4000
took 0.625256 seconds 
./test_cblas 8000
took 4.839866 seconds 
./test_cblas 10000
took 9.572862 seconds 
./test_cblas 20000
took 74.236080 seconds 

Target=ZEN : 
./test_cblas 1000
took 0.021527 seconds 
./test_cblas 2000
took 0.090566 seconds 
./test_cblas 4000
took 0.613909 seconds 
./test_cblas 8000
took 5.138182 seconds 
./test_cblas 10000
took 9.232099 seconds 
./test_cblas 20000
took 73.998256 seconds 

I'm not observing any significant change.

tkswe88 commented 5 years ago

@martin-frbg What specific preprocessor flag is required during compilation of memory.c to restore the optimal performance?

martin-frbg commented 5 years ago

@tkswe88 there should be no specific flag necessary - the active code in memory.c should be in the same state as on february 28, except that the file is now approximately twice as big as it also contains a reimplementation that uses thread-local storage (which is available by setting USE_TLS=1)

brada4 commented 5 years ago

USE_TLS=1 ... which was accidentally enabled in Makefile.rule shortly but long enought to land in a release, and should be commented out if found. Also for sake of experimentation you may restore older memory.c, the outside interfaces have not changed since.

inoryy commented 5 years ago

what's the status of this ticket and is there a way to help it move forward?

martin-frbg commented 5 years ago

Needs testers especially for Threadripper, and generally someone who knows how to optimize code for Zen. I just bought a Ryzen 2700 and hope to be able to do at least some quick and simple tests in the next few days. (Top on my list is to see how the recent improvements for Haswell map to Ryzen, e.g. optimal settings of SWITCH_RATIO and GEMM_PREFERED_SIZE in param.h)

inoryy commented 5 years ago

Not much experience optimizing code, but I do have access to a TR1950x and can regularly run tests on it. Could you describe in detail what kind of tests would you need?

martin-frbg commented 5 years ago

I am not that sure myself, but for a start it would probably help to run the dgemm and sgemm benchmarks to see if increasing the SWITCH_RATIO for ZEN in param.h from the default of 4 has a similar impact on performance as it did for HASWELL where it is now 32 - my very first tests suggest that 16 may be optimal for the 8-core Ryzen, but 32 did not appear to be much worse. Second task would be to see if fenrus75's recent improvements in the Haswell xGEMM_BETA and xGEMM_ONCOPY routines (see entries referencing skylake files in kernel/x86_64/KERNEL.HASWELL) are applicable to the Zen architecture despite its allegedly inferior AVX2 hardware. Going beyond that will probably require some in-depth experience with coding to processor specs, possibly trying things like rearranging or replacing avx2 instructions. Some pointers are in https://www.agner.org/optimize/blog/read.php?i=838 but this is a bit outside my expertise.

tkswe88 commented 5 years ago

There is a new best practice guide from the PRACE project for AMD EPYC architectures on http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-AMD.pdf This might be helpful in optimising code.

brada4 commented 5 years ago

Thats more about configuring node in a compute cluster. AMD does not make 3dnow processors for quite some decades....

tkswe88 commented 5 years ago

@martin-frbg and @brada4 I have adverse performance effects on the TR1950X when using the memory.c file as provided in the current current development version (0.3.6) and runnin gon more than 8 cores. It does not seem to matter whether I define USE_TLS or not. I have alos tried commenting out lines 1045-1047 in Makefile.system. When replacing memory.c by the one of #1468, I get good performance again. Could you give a hint on what compile time settings are required to have the preprocessor construct a from the current memory.c a version that is identical to that of #1468? I am at a loss with this.

martin-frbg commented 5 years ago

Can you tell if 0.3.5 worked for you ? (Some earlier changes that might have affected locking were inadvertently removed in 0.3.5 and put back in develop). From your reference to #1468 I assume you are compiling with OpenMP, so gcc -DNO_WARMUP -DUSE_OPENMP -DOS_LINUX -DSMP -DSMP_SERVER -E -I../.. memory.c >memory_processed.c should reflect what the compiler sees - this should be very similar to what was in #1468

martin-frbg commented 5 years ago

Actually #1468 was among the changes that were accidentally dropped in the creation of the "Frankenstein" version of memory.c that combined both TLS and non-TLS versions - this is supposed to have been rectified by #2004 three weeks ago but there may be an interaction with intervening changes from #1785 & #1814

martin-frbg commented 5 years ago

gcc -I../.. -DOS_LINUX -DUSE_THREAD=1 -DUSE_OPENMP=1 -E -P memory.c will probably give the most concise preprocessing output. And from this the crucial difference appears to be in the locking around the loop over memory[position].used (around line 1070 of the old memory.c, or line 2700 of the current version) - #1468 had the LOCK_COMMAND(&alloc_lock) around individual calls inside the do loop and conditional on !defined(USE_OPENMP) while the change from PR #1785 - being based on a version that inadvertently missed the changes from #1468 - not only moved the lock outside the loop in response to the observation of too many lock/unlock system calls but already lacked the conditionals. I seem to have missed this in #2004.

tkswe88 commented 5 years ago

@martin-frbg I followed #2040 to https://raw.githubusercontent.com/xianyi/OpenBLAS/af480b02a4a45df377acf9be0d6078609bb345c2/driver/others/memory.c, copied this version of memory.c into a snapshot of the openblas development version from Friday and recompiled. This has restored the performance to where I would like it to be. Many thanks for your help and effort!

About a year ago or so, I presented a first comparison of openblas and blis/flame performance. Since then blis and libflame have caught up quite a bit and seem to have become threads-safe. In my standard code, I can use either openblas or blis/flame with OpenMP parallelisation. openblas tends to be around 1-4 % faster on small and moderately sized workloads. However, when pure DGEMM performance is required and the matrices become rather big, at the moment, openblas seems to be the clear winner. See below for a comparison with 1 and 16 threads. The results were obtained with openblas development and blis versions downloaded last Friday and after compilation using gcc9.0 on Ubuntu Linux 18.04 (4.15 kernel). dgemm_amd_tr1950x

martin-frbg commented 5 years ago

Thank you very much for the confirmation, merging that PR now.