Test and tune for Zen 2

TiborGY commented 5 years ago

Zen 2 is now released, bringing a number of improvements to the table. Most notably, it now has 256 wide AVX units. This should in theory allow performance parity with Haswell-Coffee Lake CPUs, and initial results suggest this is true (at least for single thread). https://i.imgur.com/sFhxPrW.png

The chips also have double the L3 cache, and a generally reworked cache hierarchy. One thing to note, is that these chips do not have enough TLB cache to cover all of L2 and L3, so hugepages might be a little more important.

I might be able to get my hands on a Zen 2 system in ~1-2 months.

brada4 commented 5 years ago

It is not L3 cache per core or NUMA domain, it is per socket, like 1-2MB per core, in place of haswell's 2.5MB Smaller than zen1 L1d actually matches that of haswell.... Probably neither is considered even with zen1, just that some lengthy discussions how to work around BIOS with broken NUMA support.

TiborGY commented 5 years ago

It is not L3 cache per core or NUMA domain, it is per socket, like 1-2MB per core, in place of haswell's 2.5MB

I have no idea what you are talking about. The 3700X has 8 cores and a total of 32 MiB of L3. Internally each cluster of 4 cores share their L3, so its more like 2*16MiB of L3. That still works out to 4 MiB of L3 per core. No idea where you are getting the 1-2 MiB from.

L3 cache is not shared between the 4 core core complexes (CCXs), not even within the same die.

wjc404 commented 5 years ago

@TiborGY I also found tuning of kernel is required for zen2. I tested single-thread dgemm performance of OpenBLAS(target=Haswell) on a Ryzen 7 3700X at 3.0GHz fixed clock, got ~33GFLOPS, which was far behind the theoretical maximum (48GFLOPS at 3.0GHz). By the way I also tested my dgemm subroutine and got a speed of ~44GFLOPS.

wjc404 commented 5 years ago

AIDA64_Cache_Mem_Test_R7-3700X The speed of L3 in r7-3700X is fast, but the memory latency is still a problem. I think the enhanced size of L3 allows larger blocks from matrix B to be packed, thus reducing the bandwidth requirement for accessing matrix A and C, eliminating the problem of slow memory access.

wjc404 commented 5 years ago

I read the code of OpenBLAS's Haswell dgemm kernel and found the 2 most common FP arithmetic instructions are vfmadd231pd and (chained) vpermpd. I roughly tested the latency of vfmadd231pd and vpermpd on i9-9900K and r7-3700x, found that vfmadd231pd has a latency of 5 cycles on both CPUs, however for vpermpd the latency on r7-3700x (6 cycles) doubles that on 9900K (3 cycles). I guess the performance problem on zen2 may result from vpermpd instructions. test_program.tar.gz

martin-frbg commented 5 years ago

Interesting observation. I now see this doubling of latency for vpermpd mentioned in Agner Fog's https://www.agner.org/optimize/instruction_tables.pdf for Zen - so this apparently still applies to Zen2 as well (and it is as obviously relevant for the old issue #1461)

TiborGY commented 5 years ago

The speed of L3 in r7-3700X is fast, but the memory latency is still a problem.

The reason why you memory latency is sky high is your memory clock. 2133 MHz is a huge performance nerf for Ryzen CPUs, because the internal bus that connects the cores to the memory controller (and each other) is running at 1/2 memory clock. (this bus is conceptually similar to intels mesh/uncore clock)

102 ns is crazy high, even for Ryzen. IMO 2400 MHz should be the bare minimum speed that anyone uses, even that is because ECC UDIMMs are kinda hard to find above that. If someone is not using ECC, using 2666 or more like 3000 MHz is very much recommended. You can easily shave off 20 ns from that figure you measured.

wjc404 commented 5 years ago

I removed vperm instructions in the macros "KERNEL4x12_M1", "KERNEL4x12_M2", "KERNEL4x12_E" and "KERNEL4x12_SUB" of the file "dgemm_kernel_4x8_haswell.S" and recompiled OpenBLAS, and found a 1/4 speedup in a subsequent dgemm test (of course the results were no more meaningful), which illustrated the performance penalty is from vpermpd. Screenshot from 2019-07-15 11-37-29

(test on r7-3700x, 1thread, 3.6GHz)

On r5-1600 the performance degradation is not significant (OpenBLAS(zen) gave 27GFLOPS when theoretical maximum is 29GFLOPS for 1 thread), probably because the half throughput of fma instructions on zen1 hides the latency of vpermpd.

wjc404 commented 5 years ago

I also tested the latencies of some other AVX instructions on r7 3700x in a way similar to my previous test of vpermpd. The results are as follows: instruction vblendpd vperm2f128 vshufpd latency 1 cycle 3 cycles 1 cycle The expensive vpermpd can be replaced by a proper combination of the 3 tested instructions above (vblendpd and vshufpd should also be cheaper on common intel CPUs).

brada4 commented 5 years ago

but the memory latency is still a problem

Are you serious? You know that X GHz memory server that much words per second, there is no shortcut (There is one, called cache)

wjc404 commented 5 years ago

I changed 8 vpermpd to vshufpd in the first 4 "KERNEL4x12_*" macros in the file "dgemm_kernel_4x8_haswell.S" and received a 1/4 speedup while maintaining the correct results. dgemm_kernel_4x8_haswell.S.txt Screenshot from 2019-07-15 13-37-34

wjc404 commented 5 years ago

I then modified the macro "SAVE4x12" in a similar way and got 0.3% performance improvement. Now the performance is about 9/10 of theoretical maximum. dgemm_kernel_4x8_haswell.S.txt Screenshot from 2019-07-15 14-36-52

wjc404 commented 5 years ago

Test of more avx(2) instructions of doubles on r7-3700x (1 thread at 3.6 GHz) Screenshot from 2019-07-15 23-52-10 test_of_common_avx2_instructions.zip Instruction..... IPC.. latency vpermpd....... 0.8... 6cycs vblendpd....... 2.0... 1cyc vperm2f128.. 1.0... 3cycs vshufpd......... 2.0... 1cyc vfmadd231pd 2.0... 5cycs vaddpd.......... 2.0... 3cycs vmulpd.......... 2.0... 3cycs vhaddpd........ 0.5... 6-7cycs

A similar test on i9-9900K (1 thread, 4.4GHz) (chained vfmadd231pd test encountered endless running so it was removed from the test, luckily I've done it previously with different codes): Screenshot from 2019-07-16 15-38-59 Instruction..... IPC.. latency vpermpd....... 1.0... 3cycs vblendpd....... 3.0... 1cyc vperm2f128.. 1.0... 3cycs vshufpd......... 1.0... 1cyc vfmadd231pd 2.0... ~5cycs(previous test) vaddpd.......... 2.0... 4cycs vmulpd.......... 2.0... 4cycs vhaddpd........ 0.5... 6cycs

wjc404 commented 5 years ago

I also found that alternating vaddpd and vmulpd in the test code can get a total IPC of 4 on zen2, which was only 2 for i9-9900K.

wjc404 commented 5 years ago

A simple test of AVX load & store instructions of packed doubles on r7-3700x (3.6GHz, 1 thread): Screenshot from 2019-07-16 14-25-08 test_load_store_avx_doubles.zip

Instruction(AT&T)................ max_IPC vmovapd mem,reg ............ ......2....... vmovupd mem,reg ............ ......2....... vmaskmovpd mem,reg,reg ......2....... vbroadcastsd mem,reg ..... ......2....... vmovapd reg,mem ............ ......1....... vmovupd reg,mem ............. ......1....... vmaskmovpd reg,reg,mem. .....1/6.....

wjc404 commented 5 years ago

The same load/store test on i9-9900K (4.4 GHz, 1 thread) Screenshot from 2019-07-16 15-16-34

shared the same maximum IPCs with r7-3700x except "vmaskmovpd reg,reg,mem"(IPC=1)

wjc404 commented 5 years ago

Unlike vpermpd, vpermilpd share the same latency and IPC with vshufpd on r7-3700x, so it can also replace vpermpd in some cases.

wjc404 commented 5 years ago

Data sharing between CCXs - still problematic Synchronization latencies of shared data between cores: test on r7-3700x (3.6 GHz) Screenshot from 2019-07-31 07-05-20

Here's the code: INTER-CORE LATENCY.zip

the same test on i9-9900K: Screenshot from 2019-07-31 07-11-53

wjc404 commented 5 years ago

Synchronization bandwidths of shared data between cores: test on r7-3700x (3.6 GHz) Screenshot from 2019-07-31 12-50-34

the same test on i9-9900K: Screenshot from 2019-07-31 12-40-03

codes: INTER-CORE BANDWIDTH.zip

brada4 commented 5 years ago

AMD looks like 4-core clusters ? Does it get seen in NUMA tables anywhere?

TiborGY commented 5 years ago

AMD looks like 4-core clusters ? Does it get seen in NUMA tables anywhere?

It it accurately shown by lstopo, the L3 cache is not shared between CCXs. But it is shown as a single NUMA node, since memory access is uniform for all cores, so technically it is not a NUMA setup.

TiborGY commented 5 years ago

@wjc404 What sort of fabric clock (FCLK) are you running? The inter core bandwidth between the CCXs is probably largely affected by FCLK.

brada4 commented 5 years ago

Well, not exposed but 3x faster ... It is quite important that same data does not get dragged around outer cache without need. There is sort of no software exposure, just that probably way to hack affinity so that all threads stay in same space with shared L3

wjc404 commented 5 years ago

Sorry I don't know where to get the frequency of FCLK. It should be the default one for 3.6 GHz CPU clock.

martin-frbg commented 5 years ago

I believe AMD put in some effort to make the Linux and Windows10 schedulers aware of the special topology. OpenBLAS itself probably has little chance to create a "useful" default affinity map on its own without knowing the "bigger picture" of what kind of code it was called from and what the overall system utilization is. Perhaps a wiki page collecting links to Ryzen "best practices" whitepapers like https://www.suse.com/documentation/suse-best-practices/singlehtml/optimizing-linux-for-amd-epyc-with-sle-12-sp3/optimizing-linux-for-amd-epyc-with-sle-12-sp3.html#sec.memory_cpu_binding or the PRACE document linked in https://github.com/xianyi/OpenBLAS/issues/1461#issuecomment-455822566 might be useful.

(I think FCLK is proportional to the clock speed of the RAM installed in a particular system, so it could be that the DDR4-2133 memory shown on your AIDA screenshot lead to less than optimal performance of the interconnect )

TiborGY commented 5 years ago

The FCLK is the clock for the fabric between the core chiplet(s) and the IO die. (I think it is also the bus responsible for communication between the CCXs) The FCLK is set by the motherboard FW, under normal circumstances this should mean exactly 1/2 of the memory clock. So on most motherboards, memory speed will directly alter CCX to CCX latency and bandwidth. Memory write bandwidth is also very highly dependent on FCLK. Going from 2133 to 3200 should increase the BW between CCXs by about 50%, if the motherboard correctly keeps FCLK in sync with the memory speed.

It is possible to have a desynchronized FCLK, however it is very undesirable running your system like that, as it increases memory latency by about 20 ns and generally worsens performance. Motherboards should default to keeping the FCLK in sync with the memory speed, from 2133 to 3600 MHz. However, I have heard that some motherboards have had firmware bugs, and sometimes desynced FCLK for no good reason.

wjc404 notifications@github.com ezt írta (időpont: 2019. júl. 31., Sze, 15:39):

Sorry I don't know where to get the frequency of FCLK. It should be the default one for 3.6 GHz CPU clock.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xianyi/OpenBLAS/issues/2180?email_source=notifications&email_token=AHD2KHRACXR7ZU7U22KV63TQCGI2RA5CNFSM4H627C42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HI4KI#issuecomment-516853289, or mute the thread https://github.com/notifications/unsubscribe-auth/AHD2KHSUXKJS2N6KK64XTMTQCGI2RANCNFSM4H627C4Q .

wjc404 commented 5 years ago

@TiborGY Thanks for your guidance~ The FCLK frequency setting in my bios is AUTO. On win10 I see the fabric clock frequency is 1200 MHz from ryzen master utility.

martin-frbg commented 5 years ago

So by replacing your memory with DDR4-3600 you could increase FCLK to 1800 which would make the cross-ccx transfers look less ugly (though at an added cost of something like $150 per 16GB)

TiborGY commented 5 years ago

Officially, Zen2 only supports up to 3200 MHz memory. In practice, 3600 seems fine, beyond that you start running into issues with the fabric getting unstable, of course depending on your luck on the silicon lottery. For this reason motherboards seem to default to desynced FCLK if you apply an XMP profile faster than 3600. On a serious workstation I would probably not risk going beyond 3200. Memory stability is notoriously hard to stress test, and I would guess the same applies to fabric stability. This does have a silver lining though, 3200 MHz RAM is not too expensive, unless you want very tight memory timings (CL14-CL15).

Martin Kroeker notifications@github.com ezt írta (időpont: 2019. júl. 31., Sze, 20:16):

So by replacing your memory with DDR4-3600 you could increase FCLK to 1800 which would make the cross-ccx transfers look less ugly (though at an added cost of something like $150 per 16GB)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xianyi/OpenBLAS/issues/2180?email_source=notifications&email_token=AHD2KHUXKLNNUY6SGSHYOGTQCHJIXA5CNFSM4H627C42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3IDOXI#issuecomment-516962141, or mute the thread https://github.com/notifications/unsubscribe-auth/AHD2KHVRRIBNQ5AFFNMSWV3QCHJIXANCNFSM4H627C4Q .

brada4 commented 5 years ago

It is HyperTransport (intels rough equivalent is QPI). Though no idea how modern one does around clocking/powersaving etc....

TiborGY commented 5 years ago

It is HyperTransport (intels rough equivalent is QPI). Though no idea how modern one does around clocking/powersaving etc....

Not anymore. It used to be hypertransport before Zen. The official marketing name for the current fabric is "Infinity Fabric".

brada4 commented 5 years ago

It is not userspace programmable, if scheduler knows we might be able to just group threads in cluster sized groups accessing same memory pieces, and avoiding L3 to L3 copies

It claims 40GB/s roughly ?duplex? ?half each way? ?you are at optimum already?

marxin commented 4 years ago

test_of_common_avx2_instructions.zip Instruction..... IPC.. latency vpermpd....... 0.8... 6cycs vblendpd....... 2.0... 1cyc vperm2f128.. 1.0... 3cycs vshufpd......... 2.0... 1cyc vfmadd231pd 2.0... 5cycs vaddpd.......... 2.0... 3cycs vmulpd.......... 2.0... 3cycs vhaddpd........ 0.5... 6-7cycs

Hello.

Thank you very much for the measurement script. I modified that a bit and pushed here; https://github.com/marxin/instruction-tester

For the znver1 CPU, I've got a different numbers for model name : AMD Ryzen 7 2700X Eight-Core Processor:

make test
gcc -march=haswell testinst.S testinst.c -o testinst
./testinst
CPU frequency: 4.30 GHz
GOPs per second for vpermpd indep. instructions: 2.137337e+00, rec. throughput: 2.01
GOPs per second for vpermpd chained instructions: 2.150827e+00, latency: 2.00

GOPs per second for vpermilpd indep. instructions: 4.301699e+00, rec. throughput: 1.00
GOPs per second for vpermilpd chained instructions: 4.296690e+00, latency: 1.00

GOPs per second for vblendpd indep. instructions: 4.298875e+00, rec. throughput: 1.00
GOPs per second for vblendpd chained instructions: 4.301755e+00, latency: 1.00

GOPs per second for vperm2f128 indep. instructions: 1.435560e+00, rec. throughput: 3.00
GOPs per second for vperm2f128 chained instructions: 1.439942e+00, latency: 2.99

GOPs per second for vshufpd indep. instructions: 4.296961e+00, rec. throughput: 1.00
GOPs per second for vshufpd chained instructions: 4.296540e+00, latency: 1.00

GOPs per second for vfmadd231pd indep. instructions: 4.296248e+00, rec. throughput: 1.00
GOPs per second for vfmadd231pd chained instructions: 8.651844e-01, latency: 4.97

GOPs per second for vaddpd indep. instructions: 4.286476e+00, rec. throughput: 1.00
GOPs per second for vaddpd chained instructions: 1.443964e+00, latency: 2.98

GOPs per second for vmulpd indep. instructions: 4.304053e+00, rec. throughput: 1.00
GOPs per second for vmulpd chained instructions: 1.086745e+00, latency: 3.96

GOPs per second for vhaddpd indep. instructions: 1.433505e+00, rec. throughput: 3.00
GOPs per second for vhaddpd chained instructions: 6.227662e-01, latency: 6.90

I verified the numbers with 4. Instruction tables - Agner Fog and we've got the same numbers. I'm also sending numbers for znver2 (model name : AMD EPYC 7702 64-Core Processor):

$ make test
gcc -march=haswell testinst.S testinst.c -o testinst
./testinst
CPU frequency: 3.35 GHz
GOPs per second for vpermpd indep. instructions: 2.582493e+00, rec. throughput: 1.30
GOPs per second for vpermpd chained instructions: 5.568692e-01, latency: 6.02

GOPs per second for vpermilpd indep. instructions: 6.679462e+00, rec. throughput: 0.50
GOPs per second for vpermilpd chained instructions: 3.340770e+00, latency: 1.00

GOPs per second for vblendpd indep. instructions: 6.682278e+00, rec. throughput: 0.50
GOPs per second for vblendpd chained instructions: 3.338153e+00, latency: 1.00

GOPs per second for vperm2f128 indep. instructions: 3.339144e+00, rec. throughput: 1.00
GOPs per second for vperm2f128 chained instructions: 1.113484e+00, latency: 3.01

GOPs per second for vshufpd indep. instructions: 6.679295e+00, rec. throughput: 0.50
GOPs per second for vshufpd chained instructions: 3.338552e+00, latency: 1.00

GOPs per second for vfmadd231pd indep. instructions: 6.677935e+00, rec. throughput: 0.50
GOPs per second for vfmadd231pd chained instructions: 6.681326e-01, latency: 5.01

GOPs per second for vaddpd indep. instructions: 6.679347e+00, rec. throughput: 0.50
GOPs per second for vaddpd chained instructions: 1.113059e+00, latency: 3.01

GOPs per second for vmulpd indep. instructions: 6.681665e+00, rec. throughput: 0.50
GOPs per second for vmulpd chained instructions: 1.113511e+00, latency: 3.01

GOPs per second for vhaddpd indep. instructions: 1.670478e+00, rec. throughput: 2.01
GOPs per second for vhaddpd chained instructions: 5.135085e-01, latency: 6.52

marxin commented 4 years ago

I then modified the macro "SAVE4x12" in a similar way and got 0.3% performance improvement. Now the performance is about 9/10 of theoretical maximum. dgemm_kernel_4x8_haswell.S.txt

Hey. Can you please shared the benchmark so that I can test it on my machines ;) ? Thanks.

marxin commented 4 years ago

Hey.

I've just prepared a comparison on one znver1 and one znver22 machine for all releases from 0.3.3 to 0.3.8. I've used the following script: https://github.com/marxin/BLAS-Tester/blob/benchmark-script/test-all.py

which runs BLAS-Tester binaries with the following arguments:

$ ./test-all.py
1/12: taskset 0x1 ./bin/xsl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
2/12: taskset 0x1 ./bin/xdl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
3/12: taskset 0x1 ./bin/xcl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
4/12: taskset 0x1 ./bin/xzl1blastst -R all -N 33554432 33554432 1 -X 5 1 1 1 1 1
5/12: taskset 0x1 ./bin/xsl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
6/12: taskset 0x1 ./bin/xdl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
7/12: taskset 0x1 ./bin/xcl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
8/12: taskset 0x1 ./bin/xzl2blastst -R all -N 4096 4096 1 -X 5 1 1 1 1 1
9/12: taskset 0x1 ./bin/xsl3blastst -R all -N 2048 2048 1 -a 5 1 1 1 1 1
10/12: taskset 0x1 ./bin/xdl3blastst -R all -N 2048 2048 1 -a 5 1 1 1 1 1
11/12: taskset 0x1 ./bin/xcl3blastst -R all -N 1024 1024 1 -a 5 1 1 1 1 1 1 1 1 1 1
12/12: taskset 0x1 ./bin/xzl3blastst -R all -N 1024 1024 1 -a 5 1 1 1 1 1 1 1 1 1 1

all numbers are collected here: https://docs.google.com/spreadsheets/d/1Xb3HWbsEuMeMf1mfRPP1AdnQTYxGU-7Rmm-khzMxz98/edit#gid=228273818 (the spreadsheet contains 3 sheets).

Based on the numbers I was able to identify the following problems:

I found a typo in isamax and the speed will be restored once #2414 is merged
there's a speed drop of ~5% for GEMM, SYMM, SYR2K, SYRK, TRMM after 92b10212de6972c808ebeccfe9fac0a82012e94e (#2361, @wjc404); I also verified that locally on my AMD Ryzen 7 2700X machine
there's a speed drop for both znver1 and znver2 after 28e96458e5a4b2d8039ed16048a07892a7c960bf (#2190, @wjc404); the patch was supposed to speed it up; I can confirm vpermilpd has smaller latency (and bigger throughput), but is slower for some reason in the benchmark

I'm going to bisect other performance issues. Feel free to comment on the selected benchmarking workloads.

wjc404 commented 4 years ago

@marxin I did most of the SGEMM and DGEMM benchmarks with the 2 programs "sgemmtest_new" and "dgemmtest_new" in my repository GEMM_AVX2_FMA3. When using them on Zen processors, please set the environment variable MKL_DEBUG_CPU_TYPE to 5. For benchmarking level3 subroutines, monitoring CPU frequency is recommended (if it is never done before) as thermal throttling can affect results.

marxin commented 4 years ago

@marxin I did most of the SGEMM and DGEMM benchmarks with the 2 programs "sgemmtest_new" and "dgemmtest_new" in my repository GEMM_AVX2_FMA3. When using them on Zen processors, please set the environment variable MKL_DEBUG_CPU_TYPE to 5.

Ok, I see the program depends on a MKL header file (and needs to be linked against it). Can you please make the code more portable? It would be great to have it part of this repository or BLAS-Tester, can you please do it?

For benchmarking level3 subroutines, monitoring CPU frequency is recommended (if it is never done before) as thermal throttling can affect results.

Sure. A difference is that you probably use OPENMP with multiple threads, am I right? Can you please re-test the numbers with the corresponding GEMM test in BLAS-Tester?

martin-frbg commented 4 years ago

@marxin couldn't you use the provided binaries from wjc404's repo (which also have MKL statically linked) ? And ISTR performance figures were obtained for both single and multiple threads.

wjc404 commented 4 years ago

@marxin If you have confirmed significant performance drop of SGEMM (especially in serial execution with dimensions > 4000) on zen/zen+ chips after PR #2361 , then you can try to specify different SGEMM kernels for zen and zen2 (probably by editing "KERNEL.ZEN" & "param.h" and modifying CPU detection codes, to choose "sgemm_kernel_16x4_haswell.S" for zen/zen+ and "sgemm_kernel_8x4_haswell.c" for zen2) and make it a PR. Unfortunately I cannot access google website in China to download your results. Currently I don't have a machine with zen/zen+ CPU to test. I would be greatful if you can figure out the reason of the SGEMM performance drop (memory-bound or core-bound factors?) so I can modify the new kernel code accordingly to improve its compatibility to old zen processors.

martin-frbg commented 4 years ago

I believe the speed drops in xDOT post 0.3.6 might be due to #1965 if they are not just an artefact. If I read your table correctly, your figures for DSDOT/SDSDOT are even worse than for ZDOT, and they definitely did not receive any changes except that fix for undeclared clobbers. (Possibly the compiler was able to apply some dangerous optimizations before).

martin-frbg commented 4 years ago

@wjc404 this is marxin's spreadsheet exported from the google docs site in .xlsx format OpenBLAS - AMD ZEN.xlsx

wjc404 commented 4 years ago

@martin-frbg Thanks. @marxin Maybe the changed settings in "param.h" played a role. I didn't realize this since I have never had chance to test SGEMM on EPYC CPUs. Could you try with SGEMM_DEFAULT_P = 640 and SGEMM_DEFAULT_Q = 448 (or even larger) (modify line 669,675 and 678 in param.h and recompile OpenBLAS 0.3.8)?

marxin commented 4 years ago

@marxin I see it's about parallel performance with >4 threads.

Note that my spreadsheet only contains results for single-threaded runs. I haven't had time to run parallel tests. I'm planning to do that.

Most likely the changed settings in "param.h" played a role. I didn't realize this since I have never had chance to test SGEMM on EPYC CPUs. Could you try with SGEMM_DEFAULT_P = 640 and SGEMM_DEFAULT_Q = 448 (or even larger) (modify line 669,675 and 678 in param.h and recompile OpenBLAS 0.3.8)?

Yes, I will test the suggested changes.

marxin commented 4 years ago

@marxin If you have confirmed significant performance drop of SGEMM (especially in serial execution with dimensions > 4000) on zen/zen+ chips after PR #2361 , then you can try to specify different SGEMM kernels for zen and zen2 (probably by editing "KERNEL.ZEN" & "param.h" and modifying CPU detection codes, to choose "sgemm_kernel_16x4_haswell.S" for zen/zen+ and "sgemm_kernel_8x4_haswell.c" for zen2) and make it a PR.

Ok, I've just made a minimal reversion of #2361 which restores speed on znver1 and it also helps on znver2. Let's discuss that in #2430.

marxin commented 4 years ago

I believe the speed drops in xDOT post 0.3.6 might be due to #1965 if they are not just an artefact. If I read your table correctly, your figures for DSDOT/SDSDOT are even worse than for ZDOT, and they definitely did not receive any changes except that fix for undeclared clobbers. (Possibly the compiler was able to apply some dangerous optimizations before).

I've just re-run that locally and I can't get the slower numbers for current develop branch.

martin-frbg commented 4 years ago

Perhaps with Ryzen vs EPYC we are introducing some other variable besides znver1/znver2 even when running on a single core ? Unfortunately I cannot run benchmarks on my 2700K in the next few days (and I remember it was not easy to force it to run with a fixed core frequency and actually reproducible speeds)

MigMuc commented 4 years ago

I did the benchmark given above with my new Ryzen 7 3700X. I set the CPU frequency to 3.6 GHz (verified with zenmonitor) switching off any Turbo Core boost or Pecision Boost Overdrive settings in the BIOS. I have installed 2x8 GB RAM @ 3200 MHz. The results for the last 3 releases of OpenBLAS are given in the spreadsheet. OpenBLAS-AMD_Ryzen_R7_3700X_3600MHz.xlsx I can confirm that with the releases before v0.3.8 the SGEMM is slightly faster than in the current release.

OpenMathLib / OpenBLAS

Test and tune for Zen 2 #2180