amd / openmm-hip

15 stars 7 forks source link

RDNA3 not being utilised to its full potential #5

Open muziqaz opened 1 year ago

muziqaz commented 1 year ago

HI, I'm nearly done testing all of my AMD GPUs comparing them between OpenCL and HIP environments, and today it was 7900xtx turn. Here are the results and comparison vs 6900xt:

6900xt MBA (23.04TFLOPS)   | OpenCL (ns/day) | HIP (ns/day) | Diff.% -- | -- | -- | -- gbsa | 967.65 | 1644.17 | 69.91% rf | 831.189 | 1410.187 | 69.66% pme | 398.526 | 1046.064 | 162.48% apoa1rf | 342.568 | 505.241 | 47.49% apoa1pme | 183.49 | 381.036 | 107.66% apoa1ljpme | 127.05 | 300.048 | 136.17% amoebagk | 2.4 | 37.444 | 1460.17% amoebapme | 12.021 | 16.261 | 35.27% 7900xtx Nitro+ (61+TFLOPS)   | OpenCL (ns/day) | HIP (ns/day) | Diff.% | HIP (6900xt/7900xtx) -- | -- | -- | -- | -- gbsa | 1075.82 | 1812.23 | 68.45% | 10.22% rf | 912.438 | 1503.63 | 64.79% | 6.63% pme | 415.988 | 1103.77 | 165.34% | 5.52% apoa1rf | 437.261 | 645.718 | 47.67% | 27.8% apoa1pme | 231.098 | 521 | 125.45% | 36.73% apoa1ljpme | 164.816 | 400.924 | 143.26% | 33.62% amoebagk | 4.22695 | 42.958 | 916.29% | 14.73% amoebapme | 17.0797 | 23.0998 | 35.25% | 42.06%

Not much of the improvement going from 6900xt. I'll try to get AMD's attention to this. Will post the rest of the GPU test results in other hip/openmm area monday most likely. conda env built was standard. Have no knowledge on how to play around with fft backends, but I think that wouldn't change the outcome too much compared to vkfft

ex-rzr commented 1 year ago
  1. Most of these cases are probably too small to utilize it, could you show results for amber20 (amber20-cellulose and amber20-stmv) tests?
  2. There are some recent changes in the repo that are not included in conda package, I wonder how they perform (if you are able to build from sources on your machine).
  3. Are you sure that performance with boost frequency is relevant for comparison? https://www.amd.com/en/products/graphics/amd-radeon-rx-7900xtx says: "Boost Clock Frequency is the maximum frequency achievable on the GPU running a bursty workload. Boost clock achievability, frequency, and sustainability will vary based on several factors, including but not limited to: thermal conditions and variation in applications and workloads. GD-151". Can you check with watch -n 1 rocm-smi what the frequency really is for various tests?
ex-rzr commented 1 year ago

I forgot to add:

Have no knowledge on how to play around with fft backends, but I think that wouldn't change the outcome too much compared to vkfft

VkFFT is the fastest so there is likely no real reason to try other FFT backends, unless you want to (see https://github.com/amd/openmm-hip#fft-backends)

muziqaz commented 1 year ago
  1. The tests would be too small for 6900xt too, or Radeon 7 :) OpenCL shows same behaviour in FAH. I get just a bit better performance than 6900xt in various workloads (large atom counts too). It seems neither opencl nor HIP can utilise dual issue "pipe" available in RDNA3 arch. All the amber attempts failed on all of my tests due to some modules missing (scipy).
  2. Unfortunately, 7900xtx is back in my Windows system, this was just rare occasion just to complete the tests on all of my GPUs on Linux. The card is a bit of the brick hard to get in and out of the case. Depending on available free time, I might try compiling few things, but I'm very rusty in Linux, so reading manuals takes more time than compiling things :D
  3. My card is folding at 3ghz stable :) Even with clocks at same levels of 6900xt, RDNA3 should blow it out of the water easily. I bought Sapphire Nitro+ which is waay overbuilt compared to MBA models. It has been suggested that RDNA3 arch requires specific driver and API level optimisations to expose all the available resources.
ex-rzr commented 1 year ago

The tests would be too small for 6900xt too, or Radeon 7 :) OpenCL shows same behaviour in FAH

In my experience, only large cases like amber20-cellulose (400k atoms) and amber20-stmv (1M atoms) reflect relative performance of different GPUs, i.e. performance scales with more compute units/higher frequency/etc., smaller cases scale worse (latency of launching kernels, scheduling work groups by GPU etc. are sometimes higher than kernels' work).

All the amber attempts failed on all of my tests due to some modules missing (scipy).

Yeah, this dependency is not installed with openmm automatically as it's used only for these benchmarks. You can try to install it with conda install scipy (or pip3 install scipy).

It seems neither opencl nor HIP can utilise dual issue "pipe" available in RDNA3 arch

I didn't run OpenMM on RDNA3 but I saw that the HIP compiler generates dual issue instructions, I just wouldn't expect too much as not every pair of instructions in every kernel can be encoded using it. If 61+TFLOPS means performance of FMA with dual issue then it's completely theoretical peak performance because I doubt that most real kernels of openmm have 100% instructions with dual issue, that's impossible :) (I will be not surprised by <20-30%) I guess 61+TFLOPS can be achieved for something like matrix-matrix multiplication of really large matrices because such kernels indeed have a lot of instructions that can benefit of the dual issue feature.

Depending on available free time, I might try compiling few things, but I'm very rusty in Linux, so reading manuals takes more time than compiling things

That would be great. OpenMM (and OpenMM-HIP) has quite simple building instructions, I hope they'll work for you without issues.

Unfortunately, 7900xtx is back in my Windows system

Sad. Anyway, thanks for benchmarking. I hope you'll get a chance to run amber20 tests on this and other GPUs.

It has been suggested that RDNA3 arch requires specific driver and API level optimisations to expose all the available resources.

I'm not aware of it, do you know any details? For example, dual issue is the compiler's way to generate code, it does it but I can't say how effective. Perhaps the suggestion about drivers was for games? Because shaders are compiled basically by the driver's compiler, unlike ROCm where the compiler is a part of ROCm distribution.

muziqaz commented 1 year ago

I ran through variety of FAH projects with 7900xtx, and it is consistently 15-20% faster than 6900xt. 6900xt folds at 2.3ghz or so, 7900xtx folds at 2.95-3ghz. 7900xtx has much higher clocks and also more CUs (80 vs 96), which would be utilised by opencl/openmm regardless. But then again, 7900xtx has shader clocks (2.2Ghz or something). So that increase we see right now might be due to CU count increase from 80 to 96, which kinda makes sense. But those CUs have more resources in itself, thus the crazy increase in FLOPS. Even ignoring those FLOPS, 7900xtx should be much faster than 6900xt. And I understand we need large systems for any high end GPU. nVidia has similar issue, but they worked out their CUDA thingy quite well, and their cards are still crazy fast even with relatively low atom counts. They saw quite a jump in FAH performance going from Turing to ampere, and then more progress with Ada. Obviously nothing close to what their CEO tells everyone in the slides, but still. regarding dual issue SIMDs, nVidia moved to similar set up with Ampere few years back. With that arch they have one pipe which does fp32 only, other pipe does either fp32 or int. In FAH, workload uses both pipes as 2 fp32 pipes, since FAH doesn't need integer calcs. So I'm thinking it might be similar with RDNA3.

Regardless of that, I saw tremendous perf increase going from opencl to hip. And that is across a lot of AMD GPUs. Hopefully things start moving with HIP Fahcore :)

muziqaz commented 1 year ago

I know that's not 7900xtx, but here is Radeon 7 running amber20:

Radeon 7 | OpenCL | HIP | Diff:% -- | -- | -- | -- amber20-dhfr | 329.067 | 754.694 | 129.34% amber20-cellulose | 19.1284 | 55.9463 | 192.48% amber20-stmv | 5.94076 | 21.1691 | 256.34%

I believe 7900xtx would see similar increase, but it would still be within 20% of 6900xt.

DanielWicz commented 1 year ago

Is this project even alive ?

muziqaz commented 1 year ago

Is this project even alive ?

As far as I understand hip is working as plug in, and those interested can build openmm/hip environments within conda and build what they want. This is in Linux. In windows AMD hasn't released the SDK yet, so nothing can be tested, but hopefully soon. On folding@home side I believe it would be possible to build fahcore based on hip openmm in Linux. But I think we will hold off until windows sdk is out.