Question: How to get good AMD CPU results?

Hi,

I REALLY like this benchmark.

So much so that I plan to (most likely) use its results to make roofline plots in an upcoming paper (I will cite it as shown in README).

However, I am having issues getting proper results on AMD CPUs.

I have seen that AMD dropped all official OpenCL support for their CPUs.

I am able to still run the benchmark if I load the Intel OneAPI environment, but I get funky CPU info and the results do not seem right compared to other similar Intel CPUs.

For example, on an EPYC 7742 dual-socket system, it only detects one of the CPUs and says:

-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | AMD EPYC 7742 64-Core Processor                            |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | AMD EPYC 7742 64-Core Processor                            |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2024.18.6.0.02_160000 (Linux)                              |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 64 at 0 MHz (32 cores, 0.000 TFLOPs/s)                     |
| Memory, Cache  | 127842 MB, 512 KB global / 32 KB local                     |
| Buffer Limits  | 63921 MB global, 128 KB constant

The 0 MHz is concerning.

Then, the results seem quite a bit slower than they should be: FP64 compute 0.022 TFLOPs/s (1/64)

For example, on the EPYC 7702P (a slower CPU) with the Ubuntu opencl runtime I get: | FP64 compute 1.111 TFLOPs/s (1/64) | but it still reports 0 MHz in the info.

I really like the suggestions for installing the OpenCL runtime that the compilation spits out, but on the supercomputer I cannot install those packages to try the open source OpenCL. Is there some kind of pre-built OpenCL run time binaries that I could point to that work well on AMD CPUs?

Is there a way to fix the CPU identification to know its AMD not Intel and get the correct mHz?

Thanks!

Hi @sumseq,

The 0 MHz is just a cosmetic information. The Intel CPU Runtime for OpenCL internally uses a lookup-table to report CL_DEVICE_MAX_CLOCK_FREQUENCY, and for AMD CPUs there is simply no data in there.
The Intel(R) Corporation returned by CL_DEVICE_VENDOR is also just purely cosmetic.
Both 64-Core CPUs should be detected on a dual-socket system, and show up as a single OpenCL device with 256 compute units (2 CPUs 2 threads/core 64 cores). Check that your slurm reservation allocates the full node with both CPUs, and check if SMT is enabled. Don't forget the --exclusive flag for slurm reservation.
```
srun --nodes=1 --exclusive --time=01:00:00 --pty bash
```
I can reproduce the poor performance bahavior on dual EPYC 7302, 7313, and 7352 systems. The kernels are vectorized to AVX2, which is good. Manually turning off vectorization with export CL_CONFIG_USE_VECTORIZER=false reduces performance by ~7.9x, so the vectorization is also working as intended.
It's possible that there is special optimizations for AMD's microarchitecture that the Intel Runtime does not fully exploit. An alternative here is to use PoCL. On all of the Intel CPUs I've tested, the Intel Runtime is a lot faster than PoCL, and PoCL itself is transitioning from their in-house threading library to Intel TBB, which the Intel Runtime uses. It's possible that on AMD systems, PoCL might be faster. But all the modern AMD EPYC systems I have access to at university unfortunately don't have PoCL installed and I don't have sudo permissions, so I can't test if PoCL is faster. However in the coming weeks I'll get access to a dual EPYC 9754 system with sudo permissions to test this. I'll keep you updated.

Kind regards, Moritz

Update:

I have access to the 7742 on another supercomputer that seems to have an OpenCL runtime installed.

The benchmark seems to be using the CUDA x86 OpenCL library:

/nasa/nvidia/hpc_sdk/toss4/Linux_x86_64/24.1/cuda/lib64/libOpenCL.so.1 (0x0000145e3d859000)

However, when I try using the CUDA library on the other supercomputer the benchmark still says it cannot find the device so I think the PoCL is still needed for device identification?

Anyways, I get the following result on the machine that worked:

|----------------.------------------------------------------------------------|
| Device ID    0 | AMD EPYC 7742 64-Core Processor                            |
| Device ID    1 | Intel(R) FPGA Emulation Device                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | AMD EPYC 7742 64-Core Processor                            |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2023.16.7.0.21_160000 (Linux)                              |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 128 at 0 MHz (64 cores, 0.000 TFLOPs/s)                    |
| Memory, Cache  | 515280 MB, 512 KB global / 32 KB local                     |
| Buffer Limits  | 257640 MB global, 128 KB constant                          |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         1.370 TFLOPs/s (1/64) |
| FP32  compute                                         1.379 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.101  TIOPs/s (1/64) |
| INT32 compute                                         1.541  TIOPs/s (1/64) |
| INT16 compute                                         2.892  TIOPs/s (1/64) |
| INT8  compute                                         2.848  TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read      )                         14.36 GB/s |
| Memory Bandwidth ( coalesced      write)                         17.94 GB/s |
| Memory Bandwidth (misaligned read      )                         33.05 GB/s |
| Memory Bandwidth (misaligned      write)                         20.66 GB/s |
| PCIe   Bandwidth (send                 )                         16.30 GB/s |
| PCIe   Bandwidth (   receive           )                         17.46 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.14 GB/s |
|-----------------------------------------------------------------------------|

This is on a dual-socket node with hyper-threading disabled.

The results for the "FPGA" device are identical to those above, leading me to think that it is the other CPU socket, but being misidentified?

The TFLOPs look a lot better but I was expecting more bandwidth (since the peak is 208 GB/s).

I was able to run it with PoCL using a singularity container. It now detects the CPU correctly but the results are still not great:

.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | pthread-AMD EPYC 7742 64-Core Processor                    |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | pthread-AMD EPYC 7742 64-Core Processor                    |
| Device Vendor  | AuthenticAMD                                               |
| Device Driver  | 1.4 (Linux)                                                |
| OpenCL Version | OpenCL C 1.2 pocl                                          |
| Compute Units  | 128 at 2245 MHz (64 cores, 4.598 TFLOPs/s)                 |
| Memory, Cache  | 255437 MB, 16384 KB global / 8192 KB local                 |
| Buffer Limits  | 65536 MB global, 8192 KB constant                          |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.104 TFLOPs/s (1/64) |
| FP32  compute                                         0.105 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.199  TIOPs/s (1/24) |
| INT32 compute                                         0.217  TIOPs/s (1/24) |
| INT16 compute                                         0.444  TIOPs/s (1/12) |
| INT8  compute                                         0.741  TIOPs/s (1/8 ) |
| Memory Bandwidth ( coalesced read      )                         16.99 GB/s |
| Memory Bandwidth ( coalesced      write)                         23.69 GB/s |
| Memory Bandwidth (misaligned read      )                         91.84 GB/s |
| Memory Bandwidth (misaligned      write)                         49.60 GB/s |
| PCIe   Bandwidth (send                 )                         16.17 GB/s |
| PCIe   Bandwidth (   receive           )                         13.19 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   14.69 GB/s |
|-----------------------------------------------------------------------------|

Hi @sumseq,

I've tested a 2x EPYC 9754 system today. The Intel CPU Runtime for OpenCL is way faster than PoCL on this system too.

Kind regards, Moritz

ProjectPhysX / OpenCL-Benchmark

Question: How to get good AMD CPU results? #14