Open sumseq opened 3 months ago
Hi @sumseq,
0 MHz
is just a cosmetic information. The Intel CPU Runtime for OpenCL
internally uses a lookup-table to report CL_DEVICE_MAX_CLOCK_FREQUENCY
, and for AMD CPUs there is simply no data in there.Intel(R) Corporation
returned by CL_DEVICE_VENDOR
is also just purely cosmetic.--exclusive
flag for slurm reservation.
srun --nodes=1 --exclusive --time=01:00:00 --pty bash
export CL_CONFIG_USE_VECTORIZER=false
reduces performance by ~7.9x, so the vectorization is also working as intended.sudo
permissions, so I can't test if PoCL is faster. However in the coming weeks I'll get access to a dual EPYC 9754 system with sudo
permissions to test this. I'll keep you updated.Kind regards, Moritz
Update:
I have access to the 7742 on another supercomputer that seems to have an OpenCL runtime installed.
The benchmark seems to be using the CUDA x86 OpenCL library:
/nasa/nvidia/hpc_sdk/toss4/Linux_x86_64/24.1/cuda/lib64/libOpenCL.so.1 (0x0000145e3d859000)
However, when I try using the CUDA library on the other supercomputer the benchmark still says it cannot find the device so I think the PoCL is still needed for device identification?
Anyways, I get the following result on the machine that worked:
|----------------.------------------------------------------------------------|
| Device ID 0 | AMD EPYC 7742 64-Core Processor |
| Device ID 1 | Intel(R) FPGA Emulation Device |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | AMD EPYC 7742 64-Core Processor |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 2023.16.7.0.21_160000 (Linux) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 128 at 0 MHz (64 cores, 0.000 TFLOPs/s) |
| Memory, Cache | 515280 MB, 512 KB global / 32 KB local |
| Buffer Limits | 257640 MB global, 128 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 1.370 TFLOPs/s (1/64) |
| FP32 compute 1.379 TFLOPs/s (1/64) |
| FP16 compute not supported |
| INT64 compute 0.101 TIOPs/s (1/64) |
| INT32 compute 1.541 TIOPs/s (1/64) |
| INT16 compute 2.892 TIOPs/s (1/64) |
| INT8 compute 2.848 TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read ) 14.36 GB/s |
| Memory Bandwidth ( coalesced write) 17.94 GB/s |
| Memory Bandwidth (misaligned read ) 33.05 GB/s |
| Memory Bandwidth (misaligned write) 20.66 GB/s |
| PCIe Bandwidth (send ) 16.30 GB/s |
| PCIe Bandwidth ( receive ) 17.46 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 9.14 GB/s |
|-----------------------------------------------------------------------------|
This is on a dual-socket node with hyper-threading disabled.
The results for the "FPGA" device are identical to those above, leading me to think that it is the other CPU socket, but being misidentified?
The TFLOPs look a lot better but I was expecting more bandwidth (since the peak is 208 GB/s).
I was able to run it with PoCL using a singularity container. It now detects the CPU correctly but the results are still not great:
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | pthread-AMD EPYC 7742 64-Core Processor |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | pthread-AMD EPYC 7742 64-Core Processor |
| Device Vendor | AuthenticAMD |
| Device Driver | 1.4 (Linux) |
| OpenCL Version | OpenCL C 1.2 pocl |
| Compute Units | 128 at 2245 MHz (64 cores, 4.598 TFLOPs/s) |
| Memory, Cache | 255437 MB, 16384 KB global / 8192 KB local |
| Buffer Limits | 65536 MB global, 8192 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.104 TFLOPs/s (1/64) |
| FP32 compute 0.105 TFLOPs/s (1/64) |
| FP16 compute not supported |
| INT64 compute 0.199 TIOPs/s (1/24) |
| INT32 compute 0.217 TIOPs/s (1/24) |
| INT16 compute 0.444 TIOPs/s (1/12) |
| INT8 compute 0.741 TIOPs/s (1/8 ) |
| Memory Bandwidth ( coalesced read ) 16.99 GB/s |
| Memory Bandwidth ( coalesced write) 23.69 GB/s |
| Memory Bandwidth (misaligned read ) 91.84 GB/s |
| Memory Bandwidth (misaligned write) 49.60 GB/s |
| PCIe Bandwidth (send ) 16.17 GB/s |
| PCIe Bandwidth ( receive ) 13.19 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 14.69 GB/s |
|-----------------------------------------------------------------------------|
Hi @sumseq,
I've tested a 2x EPYC 9754 system today. The Intel CPU Runtime for OpenCL is way faster than PoCL on this system too.
Kind regards, Moritz
Hi,
I REALLY like this benchmark.
So much so that I plan to (most likely) use its results to make roofline plots in an upcoming paper (I will cite it as shown in README).
However, I am having issues getting proper results on AMD CPUs.
I have seen that AMD dropped all official OpenCL support for their CPUs.
I am able to still run the benchmark if I load the Intel OneAPI environment, but I get funky CPU info and the results do not seem right compared to other similar Intel CPUs.
For example, on an EPYC 7742 dual-socket system, it only detects one of the CPUs and says:
The 0 MHz is concerning.
Then, the results seem quite a bit slower than they should be: FP64 compute 0.022 TFLOPs/s (1/64)
For example, on the EPYC 7702P (a slower CPU) with the Ubuntu opencl runtime I get: | FP64 compute 1.111 TFLOPs/s (1/64) | but it still reports 0 MHz in the info.
I really like the suggestions for installing the OpenCL runtime that the compilation spits out, but on the supercomputer I cannot install those packages to try the open source OpenCL. Is there some kind of pre-built OpenCL run time binaries that I could point to that work well on AMD CPUs?
Is there a way to fix the CPU identification to know its AMD not Intel and get the correct mHz?
Thanks!