OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.16k stars 1.47k forks source link

performance on AMD Ryzen and Threadripper #1461

Open tkswe88 opened 6 years ago

tkswe88 commented 6 years ago

This report comes right from #1425 where the discussion drifted off from thread safety in openblas v. 0.2.20 to performance on AMD Ryzen and Threadripper processors (in this particular case a TR 1950X). I seems worthwhile to discuss this in a separate thread. Until now we had the following discussion

@tkswe88 : After plenty of initial tests with the AMD TR 1950X processor, it seems that openblas (tested 0.2.19, 0.2.20 and the development version on Ubuntu 17.10 with kernel 4.13, gcc and gfortran v. 7.2) operates roughly 20% slower on the TR1950X than on an i7-4770 (4 cores) when using 1, 2 and 4 threads. This is somewhat surprising given that both CPUs use at most AVX2 and thus should be comparable in terms of vectorisation potential. I have already adjusted the OpenMP thread affinity to exclude that the (hidden) NUMA architecture of the TR1950X causes its lower performance. Other measures I took were 1) BIOS upgrade, 2) Linux kernel upgrade to 4.15, 3) increased DIMM frequency from 2133 to 2666 MHz. Except for the latter, which gave a speedup of roughly 3%, these measures did not have any effect on execution speed. Do you have any idea where the degraded performance on the TR1950X comes from? Is this related to a current lack of optimization in openblas or do we just have to wait for the next major release of gcc/gfortran to fix the problem? Of course, I would be willing to run tests, if this was of help in developing openblas.

@brada4 : AMD has slightly slower AVX and AVX2 units per CPU, by no means slow in general, it still has heap of cores spare. Sometimes optimal AVX2 saturation means turning whole CPU cartridge to base, i.e non-turpo frequency.

@martin-frbg: Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*).

@tkswe88: Are there any tests I could do to find out about cache size detection errors or more appropriate settings for loop unroling?

@brada4 What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0

@martin-frbg: You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter). As far as appropriate settings go, the benchmark directory contains a few tests that can be used for timing individual functions. Adjusting the values in param.h (see https://github.com/xianyi/OpenBLAS/pull/1157/files) is a bit of a black art though.

tkswe88 commented 6 years ago

Here, I report on a first test with modified cache sizes.

The cache sizes reported from lstopo are for every core L1d: 32KB L1i: 64KB L2: 512KB and for each of the four CCXs (4 cores per CCX) L3: 8MB (i.e. 32 MB in total)

Hence, I performed a test on the current development version.

Assuming L1 and L2 would have to be reported per core and L3 in total, I modified the relevant lines of getarch.c to

if defined (FORCE_ZEN)

define FORCE

define FORCE_INTEL

define ARCHITECTURE "X86"

define SUBARCHITECTURE "ZEN"

define ARCHCONFIG "-DZEN " \

                 "-DL1_CODE_SIZE=65536 -DL1_CODE_LINESIZE=64 -DL1_CODE_ASSOCIATIVE=8 " \
                 "-DL1_DATA_SIZE=32768 -DL1_DATA_LINESIZE=64 -DL2_CODE_ASSOCIATIVE=8 " \
                 "-DL2_SIZE=524288 -DL2_LINESIZE=64 -DL2_ASSOCIATIVE=8 " \
                 "-DL3_SIZE=33554432 -DL3_LINESIZE=64 -DL3_ASSOCIATIVE=8 " \
                 "-DITB_DEFAULT_ENTRIES=64 -DITB_SIZE=4096 " \
                 "-DDTB_DEFAULT_ENTRIES=64 -DDTB_SIZE=4096 " \
                 "-DHAVE_MMX -DHAVE_SSE -DHAVE_SSE2 -DHAVE_SSE3 -DHAVE_SSE4_1 -DHAVE_SSE4_2 " \
                 "-DHAVE_SSE4A -DHAVE_MISALIGNSSE -DHAVE_128BITFPU -DHAVE_FASTMOVU -DHAVE_CFLUSH " \
                 "-DHAVE_AVX -DHAVE_FMA3 -DFMA3"

define LIBNAME "zen"

define CORENAME "ZEN"

endif

I hope these settings are correctly translated from lstopo. Next, I ran sudo make clean sudo make TARGET=ZEN USE_OPENMP=1 BINARY=64 FC=gfortran

After this, config.h reads as

define OS_LINUX 1

define ARCH_X86_64 1

define C_GCC 1

define 64BIT 1

define PTHREAD_CREATE_FUNC pthread_create

define BUNDERSCORE _

define NEEDBUNDERSCORE 1

define ZEN

define L1_CODE_SIZE 65536

define L1_CODE_LINESIZE 64

define L1_CODE_ASSOCIATIVE 8

define L1_DATA_SIZE 32768

define L1_DATA_LINESIZE 64

define L2_CODE_ASSOCIATIVE 8

define L2_SIZE 524288

define L2_LINESIZE 64

define L2_ASSOCIATIVE 8

define L3_SIZE 33554432

define L3_LINESIZE 64

define L3_ASSOCIATIVE 8

define ITB_DEFAULT_ENTRIES 64

define ITB_SIZE 4096

define DTB_DEFAULT_ENTRIES 64

define DTB_SIZE 4096

define HAVE_MMX

define HAVE_SSE

define HAVE_SSE2

define HAVE_SSE3

define HAVE_SSE4_1

define HAVE_SSE4_2

define HAVE_SSE4A

define HAVE_MISALIGNSSE

define HAVE_128BITFPU

define HAVE_FASTMOVU

define HAVE_CFLUSH

define HAVE_AVX

define HAVE_FMA3

define FMA3

define CORE_ZEN

define CHAR_CORENAME "ZEN"

define SLOCAL_BUFFER_SIZE 24576

define DLOCAL_BUFFER_SIZE 32768

define CLOCAL_BUFFER_SIZE 12288

define ZLOCAL_BUFFER_SIZE 8192

define GEMM_MULTITHREAD_THRESHOLD 4

Eventually, I installed using sudo make PREFIX=/usr/local install

After this, config.h has changed and reads as

define OS_LINUX 1

define ARCH_X86_64 1

define C_GCC 1

define 64BIT 1

define PTHREAD_CREATE_FUNC pthread_create

define BUNDERSCORE _

define NEEDBUNDERSCORE 1

define ZEN

define L1_CODE_SIZE 32768

define L1_CODE_ASSOCIATIVE 8

define L1_CODE_LINESIZE 64

define L1_DATA_SIZE 32768

define L1_DATA_ASSOCIATIVE 8

define L1_DATA_LINESIZE 64

define L2_SIZE 524288

define L2_ASSOCIATIVE 8

define L2_LINESIZE 64

define L3_SIZE 33554432

define L3_ASSOCIATIVE 10

define L3_LINESIZE 64

define ITB_SIZE 4096

define ITB_ASSOCIATIVE 0

define ITB_ENTRIES 64

define DTB_SIZE 4096

define DTB_ASSOCIATIVE 0

define DTB_DEFAULT_ENTRIES 64

define HAVE_CMOV

define HAVE_MMX

define HAVE_SSE

define HAVE_SSE2

define HAVE_SSE3

define HAVE_SSSE3

define HAVE_SSE4_1

define HAVE_SSE4_2

define HAVE_SSE4A

define HAVE_AVX

define HAVE_FMA3

define HAVE_CFLUSH

define HAVE_MISALIGNSSE

define HAVE_128BITFPU

define HAVE_FASTMOVU

define NUM_SHAREDCACHE 1

define NUM_CORES 1

define CORE_ZEN

define CHAR_CORENAME "ZEN"

define SLOCAL_BUFFER_SIZE 24576

define DLOCAL_BUFFER_SIZE 32768

define CLOCAL_BUFFER_SIZE 12288

define ZLOCAL_BUFFER_SIZE 8192

define GEMM_MULTITHREAD_THRESHOLD 4

Note that in particular L1_CODE_SIZE was reset to 32KB. Also some of the L?_ASSOCIATIVE values have changed.

Looking at /usr/local/include/openblas_config.h, which was generated during the installation, has copied the entries of config.h generated during compilation (i.e. it has the right L1_CODE_SIZE of 64KB).

I have not observed any performance improvement from the modification of the caches. But I wonder whether the changes to config.h (L1_CODE_SIZE) during installation may have adverse effects on performance.

I will do more tests using HASWELL targets instead of ZEN in the development version and try to find out about loop unrolling.

martin-frbg commented 6 years ago

I suspect as you ran the make install without repeating the previous TARGET=ZEN argument it re-ran getarch, and that seems to misdetect at least the L1 code size. As no part of the library got rebuilt, and the correct version of openblas_config.h got installed this should not cause any problems. Edited to add: there does not appear to be any use of L1_CODE_SIZE in the code anyway, even L1_DATA_LINESIZE appears to be used by a few old targets only. It would seem that the only hardware parameters to get right would be DTB_ENTRIES (used in level2 BLAS loop unrolling and POTRF) and L2_DATA_SIZE (used for buffer allocation in driver/others/memory.c) . Both seem correct in what you wrote above.

tkswe88 commented 6 years ago

@martin-frbg Your guessed correctly. Including the TRAGET=ZEN argument in the installation let to the right entries in config.h after installation and this did not improve performance.

martin-frbg commented 6 years ago

BTW this was also the case in the original LibGoto - L2 size used only to derive xGEMM_P parameters for Core2, Opteron and earlier, L1 and L3 size apparently unused. Seems most of the hardware parameters are detected and reported "just in case" now, but perhaps cpu development has stabilized in the sense that a given type id will no longer have variants that vary in L1 or L2 properties. (There is one oddity in l2param.h where it uses L1_DATA_LINESIZE to determine an offset, but again your value appears to be correct already.)

brada4 commented 6 years ago

Somewhere deep in Wikipedia it is said that zen-epic-ripper changes cache from write-through to write-back. That may mean that effective cache easily halves.

martin-frbg commented 6 years ago

Somehow I doubt that, or at least its relevance for the detected cache sizes. On the other hand I feel there is a need to find out which of the many parameters in param.h and elsewhere that are faithfully copied for new cpu models are actually used in the current code, and how critical their influence is. Starting from the fragment of the param.h change I linked to above, it looks to me that SNUMOPT is completely unused, and DNUMOPT has a single appearance in a debug statement where it seems to be part of a theoretical efficiency factor for the syrk function.

tkswe88 commented 6 years ago

I have been looking a bit more at the suspected performance shortage on the AMD Threadripper 1950X, its reasons and the consequences of thread-oversubscription on the TR1950X.

  1. Regarding the previously reported 20% lower performance of the TR1950X compared to the i7-4770 using 1, 2 and 4 threads, I need to correct that this was for the total runtime of my code for one particular example. For dposv, dsyrk and zgbtrf using openblas 0.2.19, 0.2.20 and 0.3.30, the i7-4770 needs about 30-35% less time than the TR1950X in my example. I stumbled across a document (seemingly from AMD) on this webpage http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-Optimizing-For-AMD-Ryzen.pdf recommending strongly to avoid software prefetch on Ryzen platforms, because this would prevent loop unrolling. The test example written in C and presented in the pdf reports a 15% speedup by not prefetching. However, the presented example is for compilation using Microsoft Visual Studio, and it is for this setup that software prefetching prevents loop unrolling. Do you think this might be a potential problem in openblas using compilation with gcc or is there hardcoded loop unrolling in openblas which would not be susceptible to this? @brada4: On the TR1950X, I monitored the clock speeds on all active cores on /proc/cpuinfo and under load they all seem to run at 3.75 GHz with little variation (base frequency of TR1950X is 3.4 GHz). Under load the frequencies of the i7-4770 were 3.5 to 3.9 GHz (according to i7z_GUI). So this is not a big difference, it seems.

  2. Using 8-16 threads on the TR1950X, I observed a version-related speedup of 2-5% in parallelised dsyrk and dposv for each of these numbers of threads, when switching from v. 0.2.19 to v. 0.2.20 or the development version. For 1 thread, there was no difference. For 4 threads, the improvement was at best 1%. For ZGBTRF and ZGBTRS, I cannot make an educated statement about possible improvements between the versions because of my self-inflicted thread oversubcription (cf. #1425 and below). However, selecting a recent openblas version has advantages unless one oversubscribes on the threads (next point).

  3. Regarding the thread-oversubscription (calling parallelised ZGBTRF and ZGBTRS from within an OpenMP parallelised loop in my code), I run my code linked to openblas 0.2.19, 0.2.20 and the development version. There is an interesting behaviour for the total run-time of this loop with respect to the number of threads and the openblas version:

no. threads runtime 0.2.19 [s] runtime 0.2.20 [s] runtime devel [s] 1 35 32.5 32.5 4 11.7 11.9 11.8 8 8.3 14.5 25.3 12 6.1 20.3 32.3 16 6.1 24.3 40.1

These numbers are for compilation of openblas and my code using gcc/gfortran v. 7.2 with optimisation level -O2 (-O3 reduces the run times by about 1 s). So, up to 4 threads the performance is comparable, and the thread-oversubscription does not really seem to play a big role. For a larger number of threads, v.0.2.19 still sees a decent speedup, but for versions 0.2.20 and devel, there is clear detoriation in performance when opting for a higher number of threads. Note that these examples where run after correcting lapack-netlib/SRC/Makefile of versions 0.2.20 and devel and recompiling (cf. #1425). So, I get correct results in all of these tests.

Since there are general improvements in speed when selecting a recent openblas version, I would want to get over the problem with thread-oversubscription. However, I would need sequential versions of ZGBTRF and ZGBTRS inside the aforementioned loop and parallelised versions of DSYRK, DPOSV, etc in other parts of the code. This seems to require compilation of sequential and parallel versions of the openblas library, linking to these and then somehow picking the right (sequential or parallel) versions of the required BLAS and LAPACK routines. Now, if openblas came as a Fortran module, this would be a no-brainer, because one could just use the "use ..., only :: ... => ..." mechanism of Fortran to restrict import of symbols and to re-name them. I have been searching the net for possible linker-related solutions that provide similar mechanisms, but to no avail. Do you have a document or webpage with a solution to this at hand?

martin-frbg commented 6 years ago

Some thoughts -

  1. prefetch should be addressed automatically by gcc for C and Fortran. The Haswell assembly files have a bunch of prefetcht0 instructions that may need looking at.
  2. Need to look again at what these two functions call internally.
  3. There have been both LAPACK version updates and pthread safety fixes between each pair of releases, and develop has new GEMM thread sheduling from #1320. In addition, develop has a less efficient AXPY at the moment due to #1332 (which would be easy to revert). But offhand there is nothing I would immediately blame the observed and rather serious loss of performance on. (This may warrant another git bisect...)
tkswe88 commented 6 years ago

Fine, I will run git bisect on Monday.

martin-frbg commented 6 years ago

Thanks. BTW there is a utility function openblas_set_num_threads() but according to #803 not all parts of the code honor it.

martin-frbg commented 6 years ago

ZGBTRF did not change between LAPACK 3.6.1 and 3.8.0, and is mostly IZAMAX + ZSCAL + ZGERU + ZTRSM + ZGEMM. Nothing I'd immediately identify as having undergone any drastic changes since 0.2.19. (Except the post-0.2.20 GEMM thread scheduling mentioned above).

Perhaps a good starting point would be 9e4b697 - a few months after 0.2.19, and shortly before both a big LAPACK update (with risk of collateral damage) and a set of thread safety fixes. If all is well up to that point, it should still display the good performance of 0.2.19. Next would be something like 99880f7 , with my thread fixes in place and the LAPACK update stabilized, immediately before the first attempt at adding Zen support. (You will need the lapack Makefile fix from here on). Then fa6a920 - a few weeks before 0.2.20, nothing significant should have changed in between, and 00c42dc, well past 0.2.20 and shortly before the rewrite of the GEMM thread scheduler.

tkswe88 commented 6 years ago

I have finished bisecting between 0.2.19 and 0.2.20 to find the cause of the performance degradation reported in point 3 (see above) when ZGBTRF and ZGBTRS are called from within an OpenMP parallelised loop. The resulting output is:

87c7d10b349b5be5ba2936bfedb498fe4f991e25 is the first bad commit commit 87c7d10b349b5be5ba2936bfedb498fe4f991e25 Author: Martin Kroeker martin@ruby.chemie.uni-freiburg.de Date: Sun Jan 8 23:33:51 2017 +0100

Fix thread data races detected by helgrind 3.12

Ref. #995, may possibly help solve issues seen in 660,883

:040000 040000 9f41e2cd82dc83e84b65d32000d6341cc7e417a8 bcd37e483226009ef28ac179f7268fe419e0b73d M driver

As already reported in point 3 (see above), there was an additional performance degradation between 0.2.20 and the development version. Would you like to have the bisection results on that, too, or shall we see whether fixing the problem between 0.2.19 and 0.2.20 removes the additional degradation?

martin-frbg commented 6 years ago

This is bad, as it means we will probably need someone more experienced with thread programming than me to improve this. :-(
(There must be millions of such people, but seeing that my PR went in unchallenged probably none of them on this project) In the worst case, we may be stuck with a decision between fast or safe code. At least it should be only one of the two files affected by that PR that plays a role here - in an OpenMP build, blas_server_omp.c replaces blas_server.c so we should have only my changes in memory.c to worry about. As they were driven by helgrind reports I still believe they were substantially correct, but perhaps they can simply be made conditional on ifndef OPENMP

Given your find, it is entirely possible that the later degradation is from #1299 where I touched the same files again, nine months older but probably no wiser. (A brute force test could be to drop the 0.2.19 memory.c into current develop and see if this restores the original performance - I think it would still compile despite some intervening changes)

tkswe88 commented 6 years ago

I have copied driver/others/memory.c from 0.2.19 to the development version and recompiled successfully. This has restored the good performance observed in 0.2.19.

martin-frbg commented 6 years ago

I have now merged a (supposed) fix that uses all the additional locks only in multithreaded builds that do not employ OpenMP. This should restore pre-0.2.20 OpenMP performance without completely reverting the thread safety fixes, hopefully something can be done about their (probable) performance impact on pure pthreads builds in the future. (Though possibly the impact was worst with OpenMP, if the change was adding another layer of locks on top of what OpenMP already imposed)

tkswe88 commented 6 years ago

The new version does not deliver correct results when compiled with

make TARGET=ZEN USE_OPENMP=1 BINARY=64 COMMON_OPT='-O2 -march=znver1 -mtune=znver1' FC=gfortran make TARGET=ZEN USE_OPENMP=1 BINARY=64 COMMON_OPT='-O2 -march=znver1 -mtune=znver1' FC=gfortran PREFIX=/usr/local install

and running more than 1 thread.

martin-frbg commented 6 years ago

Sorry. Seems I added ifdefs around a few locks that were there unconditionally in 0.2.19. Somehow none of the standard tests was able to flag this on my hardware.

tkswe88 commented 6 years ago

Sorry, I still get wrong results.

martin-frbg commented 6 years ago

Hmm, thanks. I had missed one spot near line 1120 that had a blas_unlock from the earlier version still commented out, hopefully this was causing it. Apart from that, there are only cases where I had to move a lock outside an if() block that uses a thread variable in the conditional - I do not see a functional difference but can duplicate the offending code block if necessary.

tkswe88 commented 6 years ago

I have recompiled, but still get wrong results except for when only 1 thread is used.

Hopefully, the latter message can help to locate the problem.

martin-frbg commented 6 years ago

I have reverted the previous commit for now. While that "bad unallocation" message does originate from memory.c, I do not understand how the revised version of my changes could be causing it.

martin-frbg commented 6 years ago

Unfortunately I cannot reproduce your problem with any of the tests, not with the software I normally use. BLAS-Tester suggests there is currently a problem with TRMV, but this is unrelated to the version of memory.c in use (and may be fallout from attempts to fix #1332)

tkswe88 commented 6 years ago

Do you think there are any further test I could do or would you recommend to just copy driver/others/memory.c from 0.2.19 to the development version to get higher performance (despite the thread safety issues in this older version)?

martin-frbg commented 6 years ago

I am about to upload another PR where absolutely all locking-related changes will be encapsulated in _if(n)def USEOPENMP . If that still does not work for you, some simplified testcase will be needed.

tkswe88 commented 6 years ago

With the updated memory.c, the results are fine, but the run-time performance is as degraded as in v. 0.2.20

martin-frbg commented 6 years ago

Weird. I believe a side-by-side comparison of the old and new memory.c will show that they are now functionally equivalent (with respect to locking) for USE_OPENMP=1. Did you do a full rebuild (starting from make clean) ?

martin-frbg commented 6 years ago

Running cpp -I../.. -DUSE_OPENMP -DSMP on both "old" and "new" memory.c definitely leads to functionally equivalent codes with just a few shuffled lines. The only other major difference between the codes is my addition of cgroup support in get_num_procs (for #1155, see PR #1239 for the actual code change in memory.c), perhaps you could try commenting that one out as well.

tkswe88 commented 6 years ago

Sorry for the late response! Running the suggested cpp command, the changes that I seem to see are

In both cpp-ed versions of memory.c #include prepocessor directives are replace by numeric codes after #.

martin-frbg commented 6 years ago

Are you comparing the memory.c of 0.2.19 to the lastest from the as-yet unmerged PR ? There the cpp-processed files should show no replacements of blas_(un)lock by pthreadmutex(un)lock and no new defines of pthread_mutex_t (only the definition of the alloc_lock moves up some 200 lines, but all new uses should be filtered out by the "ifndef USE_OPENMP"). The deletion of blas_goto_num and blas_omp_num is from an unrelated recent code cleanup patch. All other lines should be the same, except for the two conditionals in blas_memory_free() swapping places - "position < NUM_BUFFERS" now being checked before position is used as an address into the memory array (from #1179 - unfortunately git blame shows the wrong info here, attributing it to the reversion of an unrelated change).

tkswe88 commented 6 years ago

You were right. It seems I looked at the wrong version of memory.c by just taking the latest development version. Sorry for that! I have now downloaded memory.c from #1468 leading to https://github.com/martin-frbg/OpenBLAS/blob/7646974227a51a6c9adc9511593f5630f8fb59ee/driver/others/memory.c Please confirm that this is the right version to look at. Using this version everything seems fine. I get the right results and the run-times show a slight improvement over those for openblas 0.2.19 and gcc 7.2

Referring to point 3 in the list above, the run-times for the OpenMP parallelised loop calling parallelised ZGBTRF and ZGBTRS in my code are now as follows

no. threads runtime 0.2.19+gcc7.2 [s] runtime devel+gcc8.0 [s] 1 35 31.7 4 11.7 11.7 8 8.3 8.1 12 6.1 6.1 16 6.1 5.9

In this specific example, the loop does 33 iterations with equal work load and, hence, no improvement can be expected by going from 12 to 16 threads.

Anyhow, even using gcc 8.0 and the latest development version with the PR version of memory.c does not bring the performance close to that of an i7-4770 for 1 to 4 threads (point 1 in list above). Do you think there would be any value to try ATLAS to automatically identify potentially optimal settings for Threadripper processors and, then, import these findings into openblas?

martin-frbg commented 6 years ago

I suspect ATLAS will be sufficiently different to prevent direct import of findings, but at least it should provide some target numbers for actual performance of Threadripper zen cores. Unfortunately I am not aware of anything remotely like a simple list of "do's and don'ts" for ryzen vs haswell coding. It may make sense to try to profile your program to see where most of the time is "lost", or run some of the benchmarks for individual functions on both platforms - if it is ZGEMM, varying the the ZGEMM_DEFAULT_P,Q and/or ZGEMM_DEFAULTUNROLL values from param.h may show some effect, if it is ZSCAL or one of the other functions listed above, perhaps comparing microkernels for different cpus can provide a hint. (Not that I am at all experienced in this)

MigMuc commented 6 years ago

Regarding ATLAS I think it will be cumbersome to get all these settings which are sought during compilation. If I had a Ryzen system right now I would certainly try to test the gemm performance of the BLIS library with their already optimized implementation for Ryzen CPUs https://github.com/amd/blis and check against the findings obtained from the benchmark (https://github.com/xianyi/OpenBLAS/tree/develop/benchmark) of OpenBLAS. The BLIS framework does already provide some optimized blocking parameters for these kernels ()https://github.com/amd/blis/blob/master/config/zen/bli_kernel.h. Then one could start varying these values as suggested by @martin-frbg.

tkswe88 commented 6 years ago

@martin-frbg and @MigMuc: Thanks for the feedback! I tried to compile the last stable version of ATLAS (already a year and a half old) today. Despite using different sets of configuration parameters, I could not make it compile without a bunch of errors already popping up during the configuration phase. So, I decided to stop pursuing this thought. I will have another look at BLIS, though my first impression of BLIS and libflame on AMD Threadripper was by far not as good as that of openblas.

tkswe88 commented 6 years ago

Would it be of value to you to have PR #1468 tested on another NUMA system? I have access to a dual socket system with Xeon E5-2640 v4 CPUs.

martin-frbg commented 6 years ago
  1. You'd probably need to pull from the math-atlas project here on github if you wanted a current ATLAS
  2. I have since merged #1468 as I was confident from my comparison of the preprocessed files that it does what it was supposed to do, and you confirmed that it also actually solved the performance problem. If you have cpu time to spare, a test run on another system would still be great.
tkswe88 commented 6 years ago

I have now downloaded the latest ATLAS from math-atlas on github, but still face problems compiling it on Threadripper. So I have given up on this for the moment.

@martin-frbg: Regarding your comment on the importance of setting DTB_ENTRIES in getarch.c to get optimal results from unrolling, I have not made any changes to those parameters. I have tried to look up entries with the acronym DTB in AMD's Software Optimization Guide (http://support.amd.com/TechDocs/55723_SOG_Fam_17h_Processors_3.00.pdf) but could not find anything. What does DTB stand for?

I have started to modify a couple of files from https://github.com/xianyi/OpenBLAS/tree/develop/benchmark to include BLIS and will try to report on the findings soon.

martin-frbg commented 6 years ago

Data Translation Buffer for mapping between virtual and physical memory addresses methinks... but all the parameters detected by getarch.c that are actually used in the code appeared to be correct (my first comment above). That would seem to leave the various xGEMMDEFAULT parameters from param.h, the use of prefetch (and possibly also "align") instructions in the inline assembly, and possible differences in throughput for certain (AVX2?) instructions for experimentation.

tkswe88 commented 6 years ago

I have done some benchmarking in develop/benchmark calling dgemm and zgemm using a single thread and linking to openblas and blis. The results shown in the figures below vary a bit between runs up to K=L=M=50. I have averaged over 100 runs until K=L=M=100, 10 runs until K=L=M=500 and 1 run until K=L=M=1000. Up until K=L=M~350 blis is 10 to 20% faster in dgemm. The zgemm performance is rather similar with blis being upt to about 3% faster over some ranges of K=L=M

dgemm_amd_tr1950x zgemm_amd_tr1950x

tkswe88 commented 6 years ago

It seems that a large part of the difference in performance as compared to Intel's Haswell comes from the actual AVX2 support offered by AMD. On the relevant wikipedia page (https://en.wikipedia.org/wiki/Zen_(microarchitecture)) it says "Zen supports AVX2 but it requires two clock cycles to complete AVX2 instruction compared to Intel's one". There seem to different interpretations of what AVX2 is, to say the least. I will follow @brada4's advice and compensate using the high number of cores.

Nevertheless, the above figures suggest that there is margin for improvement in the kernels implemented in openblas for zen. I have been looking a bit at param.h as suggested by @martin-frbg and compared the entries there to those of https://github.com/amd/blis/blob/master/config/haswell/bli_kernel.h as suggested by @MigMuc. To my understanding BLIS is based on GotoBLAS/OpenBLAS or at least both come from the University of Texas at Austin. So, I wonder whether there is any documentation on how to at least tentatively translate the MC, KC, NC, NR and MR parameters in bli_kernel.h to P, Q and R parameters in param.h. While the parameters in blis seem to be nothing else but block sizes (judging by the relevant papers), I have not found any documentation on P, Q and R in GotoBLAS/OpenPLAS (interestingly Goto uses MC, KC, NC, NR and MR in his paper as well). Do you have any idea about this?

martin-frbg commented 6 years ago
  1. You could try if using one of the (micro)kernel files for earlier (non-AVX2) Intel hardware gives better performance.
  2. I do not think BLIS is derived from OpenBLAS or GotoBLAS, but both K.Goto and Xianyi were postdocs at Austin (possibly both in the group of Robert van de Geijn who features in the documentation of both packages)
  3. For P,Q,R the best hint so far was given in #1136
tkswe88 commented 6 years ago

I have performed the following tests based on the comments and suggestions in #1136 and the observation that the L2 cache in a TR1950X is twice that of a Haswell processor:

So, these tests were not very comprehensive. But looking at the comparison to blis, it seems the current settings in the development version are quite good. I will stop here and close the issue. Many thanks again for resolving the issues regarding thread safety and the performance of ZGBTRF and ZGBTRS in an OpenMP parallelised loop. Keep up the excellent work!

martin-frbg commented 6 years ago

Thanks for testing. I will see if I can get hold of a Ryzen system for further experimenting in the near future. (One thing that should be easy to do is replacing the choice of dgemm_kernel_4x8_haswell.S for the DGEMMKERNEL in KERNEL.ZEN with its pre-AVX2 counterpart dgemm_kernel_4x8_sandy.S)

tkswe88 commented 6 years ago

Please find some benchmark comparisons to an i7 4770 at the end of the post and a test with the sandybridge kernels in the next paragraph.

I have done the proposed test and replaced line 45 in KERNEL.ZEN by DGEMMKERNEL = dgemm_kernel_4x8_sandy.S However, this modification leads to error messages , when evaluating the accuracy of the test results during compilation:

OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat3 < ./dblat3.dat TESTS OF THE DOUBLE PRECISION LEVEL 3 BLAS

THE FOLLOWING PARAMETER VALUES WILL BE USED: FOR N 0 1 2 3 7 31 FOR ALPHA 0.0 1.0 0.7 FOR BETA 0.0 1.0 1.3

ROUTINES PASS COMPUTATIONAL TESTS IF TEST RATIO IS LESS THAN 16.00

RELATIVE MACHINE PRECISION IS TAKEN TO BE 2.2D-16

DGEMM PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.162772 0.150287
THESE ARE THE RESULTS FOR COLUMN 1 *** DGEMM FAILED ON CALL NUMBER: 5512: DGEMM ('N','N', 1, 31, 2, 1.0, A, 2, B, 3, 0.0, C, 2).

DSYMM PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.710259 0.511848
THESE ARE THE RESULTS FOR COLUMN 1 *** DSYMM FAILED ON CALL NUMBER: 418: DSYMM ('R','U', 1, 31, 1.0, A, 32, B, 2, 0.0, C, 2) .

DTRMM PASSED THE TESTS OF ERROR-EXITS

DTRMM PASSED THE COMPUTATIONAL TESTS ( 2592 CALLS)

DTRSM PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.106893 0.471354
THESE ARE THE RESULTS FOR COLUMN 9 *** DTRSM FAILED ON CALL NUMBER: 830: DTRSM ('R','U','N','U', 1, 31, 1.0, A, 32, B, 2) .

DSYRK PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.413231 0.209279
THESE ARE THE RESULTS FOR COLUMN 1 *** DSYRK FAILED ON CALL NUMBER: 1732: DSYRK ('U','N', 31, 2, 1.0, A, 32, 0.0, C, 32) .

DSYR2K PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 -0.160742 -0.298766
THESE ARE THE RESULTS FOR COLUMN 1 *** DSYR2K FAILED ON CALL NUMBER: 1732: DSYR2K('U','N', 31, 2, 1.0, A, 32, B, 32, 0.0, C, 32) .

END OF TESTS OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat3 < ./cblat3.dat OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat2 < ./zblat2.dat OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat3 < ./zblat3.dat rm -f ?BLAT3.SUMM OMP_NUM_THREADS=2 ./sblat3 < ./sblat3.dat rm -f ?BLAT2.SUMM OMP_NUM_THREADS=2 ./sblat2 < ./sblat2.dat OMP_NUM_THREADS=2 ./dblat3 < ./dblat3.dat TESTS OF THE DOUBLE PRECISION LEVEL 3 BLAS

THE FOLLOWING PARAMETER VALUES WILL BE USED: FOR N 0 1 2 3 7 31 FOR ALPHA 0.0 1.0 0.7 FOR BETA 0.0 1.0 1.3

ROUTINES PASS COMPUTATIONAL TESTS IF TEST RATIO IS LESS THAN 16.00

RELATIVE MACHINE PRECISION IS TAKEN TO BE 2.2D-16

DGEMM PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.162772 0.150287
THESE ARE THE RESULTS FOR COLUMN 1 *** DGEMM FAILED ON CALL NUMBER: 5512: DGEMM ('N','N', 1, 31, 2, 1.0, A, 2, B, 3, 0.0, C, 2).

DSYMM PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.710259 0.511848
THESE ARE THE RESULTS FOR COLUMN 1 *** DSYMM FAILED ON CALL NUMBER: 418: DSYMM ('R','U', 1, 31, 1.0, A, 32, B, 2, 0.0, C, 2) .

DTRMM PASSED THE TESTS OF ERROR-EXITS

DTRMM PASSED THE COMPUTATIONAL TESTS ( 2592 CALLS)

DTRSM PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.106893 0.471354
THESE ARE THE RESULTS FOR COLUMN 9 *** DTRSM FAILED ON CALL NUMBER: 830: DTRSM ('R','U','N','U', 1, 31, 1.0, A, 32, B, 2) .

DSYRK PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.413231 0.209279
THESE ARE THE RESULTS FOR COLUMN 1 *** DSYRK FAILED ON CALL NUMBER: 1732: DSYRK ('U','N', 31, 2, 1.0, A, 32, 0.0, C, 32) .

DSYR2K PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 -0.160742 -0.298766
THESE ARE THE RESULTS FOR COLUMN 1 *** DSYR2K FAILED ON CALL NUMBER: 1732: DSYR2K('U','N', 31, 2, 1.0, A, 32, B, 32, 0.0, C, 32) .

END OF TESTS OMP_NUM_THREADS=2 ./cblat3 < ./cblat3.dat

and much later

cblas_dgemm PASSED THE TESTS OF ERROR-EXITS

FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE EXPECTED RESULT COMPUTED RESULT 1 0.246126 0.206295
THESE ARE THE RESULTS FOR COLUMN 1 cblas_dgemm FAILED ON CALL NUMBER: 2515: cblas_dgemm ( CblasColMajor, CblasNoTrans, CblasNoTrans, 1, 9, 2, 1.0, A, 2, B, 3, 0.0, C, 2). cblas_dgemm FAILED ON CALL NUMBER: 1: cblas_dgemm ( CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, 1, 1, 0.0, A, 2, B, 2, 0.0, C, 2).

FATAL ERROR - TESTS ABANDONED

Just as a comparison on DGEMM and ZGEMM benchmarks for the TR1950X using Haswell kernels an an i7-4770 (Haswell processor) using 1 thread only (and DGEMM additionally for the TR1950X using 16 threads to demonstrate how well the TR1950X scales with the number of threads): dgemm_amd_tr1950x zgemm_amd_tr1950x All results were computed with the latest development version of OpenBLAS and a gcc v.8 snapshot from last week.

yubeic commented 6 years ago

@tkswe88 Hi, I recently built an experimental threadripper 1950x based server with 4GPUs and 128gb ddr4@2666. But I found the linear algebra performance really bad with both numpy and Pytorch (Magma). Even if I switch from mkl to openblas, the linear algebra performance on 1950x is still slower than even i7 6700k... Before we discard the threadripper from our server, I would like to ask if there is any way now (or wait a month or so) to make threadripper perform about 80% the speed compare to i9 7900k. My impression is that linear algebra like matrix multiplication and matrix factorization is not threadripper good at for the moment. And the architecture might be too strange for openblas too. but I would like to hear more from experts. Thanks for your help in advance! The specs looked too good that we forgot there was also mkl issues. :p Further we would like to confirm if EPYC cpus have similar issues.

brada4 commented 6 years ago

Are you saying OpenBLAS on threadripper is slower than MKL on threadripper?

yubeic commented 6 years ago

@brada4 Thanks for response! Sorry about the confusion, I didn't mean openblas is slower than mkl on 1950x. While openblas is better than mkl on 1950x, my benchmark shows the 1950+openblas is not as good as i7-6700k +mkl. To be fair, I didn't compile openblas for 1950x given the low performance. The purpose of my question is to consult the experts and figure out what I should expext from 1950x based on the current openblas developing cycle. If 1950x can achieve at least 80% of the linear algebra performance of 7900x, we would keep it. otherwise, we would like to switch to Intel until openblas catches up. At this moment this CPU is a huge bottleneck in my system, even for preprocessing to prepare the data for my GPUs. While in general preprocessing may not necessarily use linear algebra, like svd, we do not want to exclude this as an option. Overall Threadripper is a good concept convinced us to use slower memory, slower single thread performance to exchange for a parallel processing power. We hoped that it can work. But it seems the software is still far from there. Also I realize the AVX256 splitting, AVX 512 missing, cache architecture change and memory controller scheme may pose further challenge for fast dense linear algebra optimization given my experience in optimizing blas many years ago. So I'd be happy to hear something from you guys. :)

martin-frbg commented 6 years ago

Current OpenBLAS will essentially treat both CPUs as Haswell, it is unclear if performance differences (beyond differences in clock frequency) arise from AVX2 limitations in the Zen architecture or other effects. Depending on which functions you are comparing, and what matrix sizes you are using, it is also very probable that you are seeing fundamental differences in performance between MKL and OpenBLAS that are not specific to the AMD hardware.

tkswe88 commented 6 years ago

@yubeic Your question is far too general to be answered in a satisfactory and fair way. There are many different factors entering for which it is very difficult to guess how they could affect performance on your system. Nevertheless, I try to answer this in a point by point:

1) numpy and Pytorch (Magma): I code in Fortran2008, so I cannot help you here.

2) mkl and openblas: OpenMP or thread parallelism with mkl routines seems fine, but support of the AVX(2) units in the TR1950X by mkl may be a problem (I only have a version from 2017, so no idea whether this was improved in a 2018 version of mkl). In my opinion openblas is the best tuned linear algebra package for the TR1950X at the moment. For some lapack routines, however, the openblas routines do not seem to be OpenMP parallelised or vectorised, which quite heavily impacts performance (one example is Cholesky factorization using DPPSV. I just replaced this by a conversion to general format using DTPTTR and factorization using DPOSV and everything was good). Note that this is not TR1950X specific, but applies to all CPUs. So looking into this a bit for the LAPACK and BLAS routines you use may have a huge impact. @brada4 and @martin-frbg may want to correct me here.

3) i7 6700k: I can imagine that the single-thread to four-thread performance on this system for highly vectorized loads (worst case DGEMM) is about slightly better than that of the i7-4770 that I used for comparison. Nevertheless, if the vector units limit you, you can always just use more threads of the TR1950X to outperform the i7 6700k or just use your GPUs (a sufficiently good GPU should also outperform an i9 7900x in highly vectorial loads). However, by far not all parts of an application are vectorial. So, you better look at the total runtime, which also depends on bandwidth. For my main application, I reach the break even point at about 5 to 5.5 TR1950x cores, when aiming for the same total run-time as using four i7-4770 cores. Of course, this varies also a bit with the problem size.

4) UMA vs. NUMA (not reported in this thread previously, I think): This is not one of your points but can boost performance by 20% per cent in some cases. The standard setting for the TR9150X is UMA. Unfortunately, this cannot be changed to NUMA in the BIOS of at least ASUS mobos and one needs a Windows installation to use the AMD tool. Nevertheless, switching to NUMA, ensuring thread affinity with the appropriate OpenMP settings and running on four threads of one CCX gave a performance boost of 20% in a part of my standard computations that is highly bandwidth dependent (i.e. where I need to access many different arrays and structures to assemble and factorise complex valued matrices) and did not change the performance in other parts of the code. At 8 cores on two CCX, NUMA is still about 10% faster than UMA for this specific part of my code. Anyhow, you would have to test this. It really depends on the code, how well it is parallelised and how rigorously the first-touch principle was or could be followed. One may argue that having a NUMA architecture on a workstation is a drawback, because a bit of care is required in code design. I chose the TR1950X deliberately because of this feature. The bigger machines that I have access to are NUMA. So testing the NUMA performance on my workstation seems to be a good idea to me.

5) i9-7900X: I do not have access to such a system for comparison.

Again, your question is too general to answer.

My TR1950X has been running stably and reliably thus far. So, I am quite happy with it. Sure, when it comes to single-threaded AVX2 workloads, it is slower than some Intel offerings, but I guess it is vectorisable work loads that you must have bought those 4 GPUs for, right?

brada4 commented 6 years ago

@tkswe88 you are 99.9% correct

  1. MKL is not tuned for AMD at all... It uses SSE2 at best....
  2. It is BIOS defect.... Worth asking ASUS, it is not only openblas suffering. (And between the lines the approproate OMP settings are unlikely to affect OpenBLAS in proper way)
tkswe88 commented 6 years ago

@brada4 on point 4: The absence of the UMA/NUMA switch in the bios seems to be deliberate. Whatever OpenMP setting regarding thread affinity and number of threads I make, OpenBLAS seems to honour that