ProjectPhysX / OpenCL-Benchmark

A small OpenCL benchmark program to measure peak GPU/CPU performance.
Other
161 stars 19 forks source link

Question: How to get good AMD CPU results? #14

Open sumseq opened 3 months ago

sumseq commented 3 months ago

Hi,

I REALLY like this benchmark.

So much so that I plan to (most likely) use its results to make roofline plots in an upcoming paper (I will cite it as shown in README).

However, I am having issues getting proper results on AMD CPUs.

I have seen that AMD dropped all official OpenCL support for their CPUs.

I am able to still run the benchmark if I load the Intel OneAPI environment, but I get funky CPU info and the results do not seem right compared to other similar Intel CPUs.

For example, on an EPYC 7742 dual-socket system, it only detects one of the CPUs and says:

-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | AMD EPYC 7742 64-Core Processor                            |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | AMD EPYC 7742 64-Core Processor                            |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2024.18.6.0.02_160000 (Linux)                              |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 64 at 0 MHz (32 cores, 0.000 TFLOPs/s)                     |
| Memory, Cache  | 127842 MB, 512 KB global / 32 KB local                     |
| Buffer Limits  | 63921 MB global, 128 KB constant              

The 0 MHz is concerning.

Then, the results seem quite a bit slower than they should be: FP64 compute 0.022 TFLOPs/s (1/64)

For example, on the EPYC 7702P (a slower CPU) with the Ubuntu opencl runtime I get: | FP64 compute 1.111 TFLOPs/s (1/64) | but it still reports 0 MHz in the info.

I really like the suggestions for installing the OpenCL runtime that the compilation spits out, but on the supercomputer I cannot install those packages to try the open source OpenCL. Is there some kind of pre-built OpenCL run time binaries that I could point to that work well on AMD CPUs?

Is there a way to fix the CPU identification to know its AMD not Intel and get the correct mHz?

Thanks!

ProjectPhysX commented 3 months ago

Hi @sumseq,

Kind regards, Moritz

sumseq commented 3 months ago

Update:

I have access to the 7742 on another supercomputer that seems to have an OpenCL runtime installed.

The benchmark seems to be using the CUDA x86 OpenCL library:

/nasa/nvidia/hpc_sdk/toss4/Linux_x86_64/24.1/cuda/lib64/libOpenCL.so.1 (0x0000145e3d859000)

However, when I try using the CUDA library on the other supercomputer the benchmark still says it cannot find the device so I think the PoCL is still needed for device identification?

Anyways, I get the following result on the machine that worked:

|----------------.------------------------------------------------------------|
| Device ID    0 | AMD EPYC 7742 64-Core Processor                            |
| Device ID    1 | Intel(R) FPGA Emulation Device                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | AMD EPYC 7742 64-Core Processor                            |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2023.16.7.0.21_160000 (Linux)                              |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 128 at 0 MHz (64 cores, 0.000 TFLOPs/s)                    |
| Memory, Cache  | 515280 MB, 512 KB global / 32 KB local                     |
| Buffer Limits  | 257640 MB global, 128 KB constant                          |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         1.370 TFLOPs/s (1/64) |
| FP32  compute                                         1.379 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.101  TIOPs/s (1/64) |
| INT32 compute                                         1.541  TIOPs/s (1/64) |
| INT16 compute                                         2.892  TIOPs/s (1/64) |
| INT8  compute                                         2.848  TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read      )                         14.36 GB/s |
| Memory Bandwidth ( coalesced      write)                         17.94 GB/s |
| Memory Bandwidth (misaligned read      )                         33.05 GB/s |
| Memory Bandwidth (misaligned      write)                         20.66 GB/s |
| PCIe   Bandwidth (send                 )                         16.30 GB/s |
| PCIe   Bandwidth (   receive           )                         17.46 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.14 GB/s |
|-----------------------------------------------------------------------------|

This is on a dual-socket node with hyper-threading disabled.

The results for the "FPGA" device are identical to those above, leading me to think that it is the other CPU socket, but being misidentified?

The TFLOPs look a lot better but I was expecting more bandwidth (since the peak is 208 GB/s).

sumseq commented 3 months ago

I was able to run it with PoCL using a singularity container. It now detects the CPU correctly but the results are still not great:

.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | pthread-AMD EPYC 7742 64-Core Processor                    |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | pthread-AMD EPYC 7742 64-Core Processor                    |
| Device Vendor  | AuthenticAMD                                               |
| Device Driver  | 1.4 (Linux)                                                |
| OpenCL Version | OpenCL C 1.2 pocl                                          |
| Compute Units  | 128 at 2245 MHz (64 cores, 4.598 TFLOPs/s)                 |
| Memory, Cache  | 255437 MB, 16384 KB global / 8192 KB local                 |
| Buffer Limits  | 65536 MB global, 8192 KB constant                          |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.104 TFLOPs/s (1/64) |
| FP32  compute                                         0.105 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.199  TIOPs/s (1/24) |
| INT32 compute                                         0.217  TIOPs/s (1/24) |
| INT16 compute                                         0.444  TIOPs/s (1/12) |
| INT8  compute                                         0.741  TIOPs/s (1/8 ) |
| Memory Bandwidth ( coalesced read      )                         16.99 GB/s |
| Memory Bandwidth ( coalesced      write)                         23.69 GB/s |
| Memory Bandwidth (misaligned read      )                         91.84 GB/s |
| Memory Bandwidth (misaligned      write)                         49.60 GB/s |
| PCIe   Bandwidth (send                 )                         16.17 GB/s |
| PCIe   Bandwidth (   receive           )                         13.19 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   14.69 GB/s |
|-----------------------------------------------------------------------------|
ProjectPhysX commented 2 months ago

Hi @sumseq,

I've tested a 2x EPYC 9754 system today. The Intel CPU Runtime for OpenCL is way faster than PoCL on this system too.

Kind regards, Moritz