Open ProjectPhysX opened 1 year ago
Nvidia RTX 3050 Ti Laptop (M, or Mobile, was used earlier for 3050)
FP16-C: 2253 MLUPs/s
FP16-S: 2341 MLUPs/s
FP32: 1181 MLUPs/s
GTX 1650 Laptop GPU
AMD 5600xt
FP16C:
FP16S:
FP32:
Laptop with AMD 5300U FP32/FP16C is dramatically slow :(
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \_.-" | | "-._/ / |
| \ _.-" _ "-._ / |
| \.-" _.-" "-._ "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / FluidX3D Version 2.8 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | AMD Radeon Graphics (renoir, LLVM 15.0.6, DRM 3.52, 6.3.0-1-amd64) |
| Device ID 1 | gfx90c:xnack- |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | gfx90c:xnack- |
| Device Vendor | Advanced Micro Devices, Inc. |
| Device Driver | 3581.0 (HSA1.1,LC) |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 6 at 1500 MHz (384 cores, 1.152 TFLOPs/s) |
| Memory, Cache | 1024 MB, 16 KB global / 64 KB local |
| Buffer Limits | 870 MB global, 891289 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 145 | 11 GB/s | 9 | 3206 60% | 0s |
NVidia RTX A4000 (Ampere) short: FP16S: 4945, FP16C:4664, FP32:2500
FP16S:
| \ / FluidX3D Version 2.9 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA RTX A4000 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA RTX A4000 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.86.10 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 48 at 1560 MHz (6144 cores, 19.169 TFLOPs/s) |
| Memory, Cache | 16106 MB, 1344 KB global / 48 KB local |
| Buffer Limits | 4026 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 4920 | 379 GB/s | 293 | 9990 0% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4945 |
FP16C:
| \ / FluidX3D Version 2.9 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA RTX A4000 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA RTX A4000 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.86.10 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 48 at 1560 MHz (6144 cores, 19.169 TFLOPs/s) |
| Memory, Cache | 16106 MB, 1344 KB global / 48 KB local |
| Buffer Limits | 4026 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 4516 | 348 GB/s | 269 | 9988 80% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4664 |
FP32:
| \ / FluidX3D Version 2.9 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA RTX A4000 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA RTX A4000 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.86.10 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 48 at 1560 MHz (6144 cores, 19.169 TFLOPs/s) |
| Memory, Cache | 16106 MB, 1344 KB global / 48 KB local |
| Buffer Limits | 4026 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 2496 | 382 GB/s | 149 | 9999 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2500 |
NVIDIA T500
FP16S
| \ / FluidX3D Version 2.9 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | Intel(R) Iris(R) Xe Graphics |
| Device ID 1 | NVIDIA T500 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | NVIDIA T500 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 529.08 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 14 at 1695 MHz (1792 cores, 6.075 TFLOPs/s) |
| Memory, Cache | 4095 MB, 448 KB global / 48 KB local |
| Buffer Limits | 1023 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 573 | 44 GB/s | 34 | 9998 80% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 578
FP16C
| \ / FluidX3D Version 2.9 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | Intel(R) Iris(R) Xe Graphics |
| Device ID 1 | NVIDIA T500 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | NVIDIA T500 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 529.08 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 14 at 1695 MHz (1792 cores, 6.075 TFLOPs/s) |
| Memory, Cache | 4095 MB, 448 KB global / 48 KB local |
| Buffer Limits | 1023 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 653 | 50 GB/s | 39 | 9998 80% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 665
FP32
| \ / FluidX3D Version 2.9 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | Intel(R) Iris(R) Xe Graphics |
| Device ID 1 | NVIDIA T500 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | NVIDIA T500 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 529.08 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 14 at 1695 MHz (1792 cores, 6.075 TFLOPs/s) |
| Memory, Cache | 4095 MB, 448 KB global / 48 KB local |
| Buffer Limits | 1023 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 335 | 51 GB/s | 20 | 9999 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 339
Nvidia RTX A2000
FP32
AMD Radeon RX 6700 XT
FP32
FP16S
FP16C
I'm reporting the performance of Nvidia CMP 170HX.
It's worth introducing some background first. This is a peculiar mining-special GPU based on the Ampere architecture and the Tesla A100 silicon. I purchased it second hand from a closed mining farm to see if it's any good for running simulations. On one hand, this mining card has Tesla A100's GA100 silicon and 1500 GB/s DRAM bandwidth. On the other hand, Nvidia knows the hardware specs are attractive to all HPC and AI users beyond mining, so they tried as best as they could to make this GPU totally useless for those purposes, this is done by locking everything out besides memory bandwidth and integer operations - memory size, PCIe speed, FP32/FP64 are all almost rendered useless (no pun intended).
In fact, the locked-down FP32 performance is so slow on this GPU, it turns Lattice Boltzmann Method from a memory-bound kernel to a compute-bound kernel, hence, FP32/FP16S and FP32/FP16C are not faster than FP32/FP32... Just unbelievable.
For a baseline check, the GPU really does have 1200 GB/s of HBM2e, but everything else is locked down.
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA Graphics Device |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA Graphics Device |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.104.05 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s) |
| Memory, Cache | 7961 MB, 1960 KB global / 48 KB local |
| Buffer Limits | 1990 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.189 TFLOPs/s (1/64) |
| FP32 compute 0.395 TFLOPs/s (1/64) |
| FP16 compute not supported |
| INT64 compute 2.516 TIOPs/s (1/12) |
| INT32 compute 12.493 TIOPs/s (1/2 ) |
| INT16 compute 10.013 TIOPs/s (1/3 ) |
| INT8 compute 10.219 TIOPs/s (1/3 ) |
| Memory Bandwidth ( coalesced read ) 1198.53 GB/s |
| Memory Bandwidth ( coalesced write) 1336.89 GB/s |
| Memory Bandwidth (misaligned read ) 793.58 GB/s |
| Memory Bandwidth (misaligned write) 139.86 GB/s |
| PCIe Bandwidth (send ) 0.81 GB/s |
| PCIe Bandwidth ( receive ) 0.84 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen1 x16) 0.82 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit. |
'-----------------------------------------------------------------------------'
It's not tested by this benchmark, but according to gpu-burn
, you get 6235 GFLOPS with Tensor Cores, which surprisingly is not totally locked down.
$ ./gpu_burn -tc 3600
Using compare file: compare.ptx
Burning for 3600 seconds.
GPU 0: NVIDIA Graphics Device (UUID: GPU-b40dd576-2299-87fc-7652-05377f5cf6e7)
Initialized device 0 with 7961 MB of memory (7660 MB available, using 6894 MB of it), using FLOATS, using Tensor Cores
Results are 268435456 bytes each, thus performing 24 iterations
10.1% proc'd: 2064 (6235 Gflop/s) errors: 0 temps: 32 C
Summary at: Tue Oct 24 04:29:03 UTC 2023
Also, it's worth noting that the reported "Gen1 x16" is incorrect. There are two layers of PCIe bandwidth lockdown. The first layer is the hardware or VBIOS lockdown to PCIe Gen 1, the next is the circuit board level lockdown to x4 from x16 by omitting the AC coupling capacitors. The PCB level lockdown should be trivially bypassable by hardware modification, but still at Gen1 so it would be of limited usefulness.
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Device Current : 1
Device Max : 1
Host Max : 3
Link Width
Max : 16x
Current : 4x
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA Graphics Device |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA Graphics Device |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.104.05 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s) |
| Memory, Cache | 7961 MB, 1960 KB global / 48 KB local |
| Buffer Limits | 1990 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 2276 | 348 GB/s | 136 | 9999 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2276 |
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA Graphics Device |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA Graphics Device |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.104.05 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s) |
| Memory, Cache | 7961 MB, 1960 KB global / 48 KB local |
| Buffer Limits | 1990 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 2250 | 173 GB/s | 134 | 9996 60% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2250
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA Graphics Device |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.104.05 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s) |
| Memory, Cache | 7961 MB, 1960 KB global / 48 KB local |
| Buffer Limits | 1990 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 2266 | 174 GB/s | 135 | 9999 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2266
@biergaizi omg, the 170HX is literally e-waste. Thanks a lot for the benchmarks! Is it possible to remove some of the crippling by modifying the VBIOS?
omg, the 170HX is literally e-waste.
I purchased this GPU specifically for developing and experimenting with GPU-accelerated FDTD electromagnetic field simulations. A naive implementation of this algorithm has an Arithmetic Intensity of 0.25 FLOPS/bytes on the far-left side of the roofline curve. Thus, my test using the FDTD kernel shows the GPU still gives me a huge acceleration even with this level of performance lockdown. This is 1.5x faster than the Radeon VII / Instinct MI50, on par with the actual A100. This is currently unmatched by everything else (if rumors are to believed, RTX 5090 is eventually going to match it at 1.5 TB/s in 2024).
Real-world simulations won't be as fast, VRAM and PCIe are going to be serious bottlenecks... Still, not the worst $500 I've spent... (doesn't change the fact that it's still a case of caveat emptor, do not buy unless you know exactly what you're buying...)
Is it possible to remove some of the crippling by modifying the VBIOS?
I'm not well-versed in the GPU modding scene, so this is a question that I also want to know... According to public information, Nvidia has VBIOS digital signature checks since recent years, making VBIOS modification impossible. Last month, a bypass was found for RTX 20-series "Turing" based GPUs, but it's not applicable to Ampere GPUs. I'm afraid that VBIOS modding is not possible.
For the purpose of writing a review article, I'm now trying to plot a roofline model of the CMP 170HX. According to the documentation,
363 (FP32/FP32) or 406 (FP32/FP16S) or 1275 (FP32/FP16C) FLOPs per time step (FP32+INT32 operations counted combined)
Thus, 1 MLUPs/s is equivalent to 363 FLOPS, and 2276 MLUPs/s implies a performance of 826.188 GFLOPS? But this is higher than CMP 170HX's FP32 performance of 395 GFLOPS as measured by benchmarks. How can FluidX3D run at a speed above the hardware FP32 limit? The documentation says the FLOPS is in fact "FP32+INT32 operations counted combined". Does it mean that the quoted number "363 FLOPS" does not really represent pure-FP32 operations, but it also includes INT32 operations? If it's true, it would explain my result, since CMP 170HX's integer performance is an uncapped 12 TIOPS. Do you have a detailed breakdown between FP32 and INT32 operations?
Hi @biergaizi,
in the FP32/FP32 setting, 1 cell update takes 363
arithmetic operations consisting of 261
FP32 flops + 102
INT32 ops. 1 MLUPs/s means 1 million cell updates per second, so 2276 MLUPs/s
means 2276E6 LUPs/s * 363 Flops/LUP = 826 GOps/s
, consisting of 594 Flops/s
(FP32) and 232 GIops
(INT32).
The OpenCL-Benchmark measures Flops only with fused-multiply-add (FMA) instructions that can do 2 Flops/cycle; all other floating-point operations do 1 or less Flops/cycle. It's possible that Nvidia's artificial restrictions on ALUs extend differently across different types of Flops, like FMA, addition, multiplication, division, rsqrt, fmin/fmax, trigonometry etc. Not all Flops are created equal. Likely FMA is crippled the most, which would explain the very low benchmark performance. FluidX3D uses not only FMA but some other operations too, which might not be crippled as much.
Thanks for sharing your super interesting findings here and on Mastodon!!
Kind regards, Moritz
Not all Flops are created equal. Likely FMA is crippled the most, which would explain the very low benchmark performance.
Thanks for the hint. I just tried a self-written benchmark in SYCL with and without FMA enabled (disabled via LLVM's -ffp-contract=off
). I found CMP 170HX's non-FMA single-precision performance is ~6200 GFLOPS, just like the same number reported by gpu-burn
in mixed-precision mode.
Hence, FMA on this card is basically disabled, meanwhile, regular FP multiply or add are limited but still functional, with low but acceptable performance. 6200 GFLOPS is around the speed of a Titan X (Maxwell) or an RTX 2060. A hypothetical RTX 2060 with extremely fast memory is still useful for many HPC simulations. Unfortunately, because nearly all GPU code assumes FMA is functional, there's no way to avoid its use explicitly. Even if it can be manually disabled by modifying the OpenCL runtime, still, if the original code is written with the assumption of FMA (single rounding) instead of MAD (two roundings) in mind - which is almost always the case - it would reduce numerical precision and making the results untrustworthy.
In conclusion, it has become clear that Nvidia's tactic to prevent the CMP 170HX from being used for HPC tasks is a simple but effective trick - disabling FMA to break software compatibility.
Now it makes me wonder if the use of single-rounding FMA is mandatory for numerical integrity in FluidX3D.
Hi @biergaizi,
some additional rounding should not affect results too much; main source of error is discretization and not round-off, although round-off is very muc optimized for in the current implementation. You can replace-all fma(
with mad(
in kernel.cpp
and/or remove the -cl-fast-relaxed-math
in opencl.hpp
.
Alternatively, replace fma(
with fake_fma(
in kernel.cpp
and add a macro "\n #define fake_fma(a,b,c) ((a)*(b)+(c))"
here.
Kind regards, Moritz
You can replace-all fma( with mad( in kernel.cpp and/or remove the -cl-fast-relaxed-math in opencl.hpp.
Both FMA and MAD are restricted on this GPU. Furthermore, my experiment with OpenCL showed that if mad()
or fma()
has been used explicitly, there's no way to prevent the compiler from generating the corresponding instructions apart from modifying the OpenCL runtime. This is likely true across platforms.
Alternatively, replace fma( with fake_fma( in kernel.cpp and add a macro "\n #define fake_fma(a,b,c) ((a)*(b)+(c))" here.
After this modification, FP32/FP32 performance improved to 2919 MLUPs/s and finally broke the 1 TFLOPS barrier. Unfortunately, the compiler still performs some FMA/MAD transformations by default and there's no way to prevent the compiler from doing that, again, apart from modifying the OpenCL runtime or the compiler. Since the Nvidia OpenCL compiler is proprietary inside libnvidia-opencl.so.1
, it's not possible without extensive reverse engineering.
Thus, I started wondering, "is it possible to replace the Nvidia OpenCL compiler with LLVM/clang?", and suddenly I remembered POCL - a free and portable OpenCL runtime based on LLVM/clang with the ability to target Nvidia PTX. I immediately installed POCL and modified the function pocl_llvm_build_program()
in its source code to disable FMA/MAD transformations using the old trick of -ffp-contract=off
.
And... Success! A modified FluidX3D with all FMA usage removed and a modified POCL with FMA/MAD disabled allowed FluidX3D to unleash the full power of Nvidia's GA100 silicon, at least in FP32/FP32 mode! :rocket:
|----------------.------------------------------------------------------------|
| Device ID | 2 |
| Device Name | NVIDIA Graphics Device |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 4.0 |
| OpenCL Version | OpenCL C 1.2 PoCL |
| Compute Units | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s) |
| Memory, Cache | 7961 MB, 0 KB global / 48 KB local |
| Buffer Limits | 1990 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 7681 | 1175 GB/s | 458 | 9985 50% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 7684
But FP32/FP16S and FP32/FP16C modes are not faster than FP32/FP32 (although faster than their unmodified versions), possibly because of similar problems in the floating-point operations involved. I wonder if they too can be worked around.
FP32/FP16S:
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 7315 | 563 GB/s | 436 | 9983 30% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 7316 |
FP32/FP16C:
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 4882 | 376 GB/s | 291 | 9990 0% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4883 |
Hi @biergaizi,
amazing that it worked! Not the first instance where PoCL beat Nvidia's own runtime. 🖖😛
You can try inserting "\n #pragma OPENCL FP_CONTRACT OFF"
here and see if this fixes the bad performance on the Nvidia compiler.
Kind regards, Moritz
You can try inserting "\n #pragma OPENCL FP_CONTRACT OFF" here and see if this fixes the bad performance on the Nvidia compiler.
This worked perfectly. It even fixed the performance problem of FP32/FP16S on the Nvidia compiler (PoCL has low performance probably because of a code generation problem). Now the performance in both cases are close to the Nvidia A100! The only exception is FP32/FP16C - the custom floating-point format probably either increased the arithmetic intensity beyond the FP32 non-FMA limit, or hit other restrictions.
The CMP 170HX suddenly has its killer app now.
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA Graphics Device |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.104.05 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s) |
| Memory, Cache | 7961 MB, 1960 KB global / 48 KB local |
| Buffer Limits | 1990 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 7583 | 1160 GB/s | 452 | 9994 40% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 7585 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 12386 | 954 GB/s | 738 | 9997 70% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 12392 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 6853 | 528 GB/s | 408 | 9985 50% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 6859 |
Even better is the fact that the minimum changes needed for this workaround is just a two-line patch:
diff --git a/src/lbm.cpp b/src/lbm.cpp
index d99202f..28aeb25 100644
--- a/src/lbm.cpp
+++ b/src/lbm.cpp
@@ -286,6 +286,8 @@ void LBM_Domain::enqueue_unvoxelize_mesh_on_device(const Mesh* mesh, const uchar
}
string LBM_Domain::device_defines() const { return
+ "\n #pragma OPENCL FP_CONTRACT OFF" // prevents implicit FMA optimizations
+ "\n #define fma(a, b, c) ((a) * (b) + (c))" // shadows OpenCL explicit function fma()
"\n #define def_Nx "+to_string(Nx)+"u"
"\n #define def_Ny "+to_string(Ny)+"u"
"\n #define def_Nz "+to_string(Nz)+"u"
` |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA GeForce RTX 4080 Laptop GPU | | Device Vendor | NVIDIA Corporation | | Device Driver | 537.42 | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 58 at 2280 MHz (7424 cores, 33.853 TFLOPs/s) | | Memory, Cache | 12281 MB, 1624 KB global / 48 KB local | | Buffer Limits | 3070 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | Info: Allocating memory. This may take a few seconds. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | Grid Domains | 1 x 1 x 1 = 1 | | LBM Type | D3Q19 SRT (FP32/FP32) | | Memory Usage | CPU 272 MB, GPU 1x 1488 MB | | Max Alloc Size | 1216 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 2544 | 389 GB/s | 152 | 9992 20% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 2577 |
`
| Info: Peak MLUPs/s = 5086
| Info: Peak MLUPs/s = 5114
Haven't seen results for Nvidia A30.
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA A30 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA A30 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.129.03 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s) |
| Memory, Cache | 24062 MB, 1568 KB global / 48 KB local |
| Buffer Limits | 6015 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 5.053 TFLOPs/s (1/2 ) |
| FP32 compute 10.215 TFLOPs/s ( 1x ) |
| FP16 compute not supported |
| INT64 compute 1.990 TIOPs/s (1/4 ) |
| INT32 compute 10.285 TIOPs/s ( 1x ) |
| INT16 compute 8.158 TIOPs/s (2/3 ) |
| INT8 compute 8.316 TIOPs/s (2/3 ) |
| Memory Bandwidth ( coalesced read ) 806.94 GB/s |
| Memory Bandwidth ( coalesced write) 900.40 GB/s |
| Memory Bandwidth (misaligned read ) 651.78 GB/s |
| Memory Bandwidth (misaligned write) 80.94 GB/s |
| PCIe Bandwidth (send ) 19.16 GB/s |
| PCIe Bandwidth ( receive ) 13.22 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 12.30 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit. |
'-----------------------------------------------------------------------------'
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \_.-" | | "-._/ / |
| \ _.-" _ "-._ / |
| \.-" _.-" "-._ "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / FluidX3D Version 2.10 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA A30 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA A30 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.129.03 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s) |
| Memory, Cache | 24062 MB, 1568 KB global / 48 KB local |
| Buffer Limits | 6015 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 5712 | 440 GB/s | 340 | 9994 40% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5726
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \_.-" | | "-._/ / |
| \ _.-" _ "-._ / |
| \.-" _.-" "-._ "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / FluidX3D Version 2.10 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA A30 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA A30 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.129.03 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s) |
| Memory, Cache | 24062 MB, 1568 KB global / 48 KB local |
| Buffer Limits | 6015 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 9718 | 748 GB/s | 579 | 9993 30% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 9721 |
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \_.-" | | "-._/ / |
| \ _.-" _ "-._ / |
| \.-" _.-" "-._ "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / FluidX3D Version 2.10 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA A30 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA A30 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.129.03 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s) |
| Memory, Cache | 24062 MB, 1568 KB global / 48 KB local |
| Buffer Limits | 6015 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 5002 | 765 GB/s | 298 | 9997 70% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5004 |
Apple M1 Ultra 128G FP32
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \_.-" | | "-._/ / |
| \ _.-" _ "-._ / |
| \.-" _.-" "-._ "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / FluidX3D Version 2.10 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | Apple M1 Ultra |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Apple M1 Ultra |
| Device Vendor | Apple |
| Device Driver | 1.2 1.0 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s) |
| Memory, Cache | 98304 MB, 0 KB global / 32 KB local |
| Buffer Limits | 18432 MB global, 1048576 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 4448 | 681 GB/s | 265 | 9987 70% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4519 |
FP16S
|----------------.------------------------------------------------------------|
| Device ID 0 | Apple M1 Ultra |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Apple M1 Ultra |
| Device Vendor | Apple |
| Device Driver | 1.2 1.0 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s) |
| Memory, Cache | 98304 MB, 0 KB global / 32 KB local |
| Buffer Limits | 18432 MB global, 1048576 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 8286 | 638 GB/s | 494 | 9995 50% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 8418 |
FP16C
|----------------.------------------------------------------------------------|
| Device ID 0 | Apple M1 Ultra |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Apple M1 Ultra |
| Device Vendor | Apple |
| Device Driver | 1.2 1.0 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s) |
| Memory, Cache | 98304 MB, 0 KB global / 32 KB local |
| Buffer Limits | 18432 MB global, 1048576 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 6794 | 523 GB/s | 405 | 9979 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 6915 |
How is everyone doing the benchmarks for multi gpu configurations? I'm playing around with mi25's and not seeing anywhere near what the specs would suggest I should. I'm wondering if I have a hardware bottleneck or if I missed something in the setup.
@marcc1229 use the "2/4/8 GPUs" lines in the benchmark setup, and for memory use a value close to the VRAM capacity of one GPU, like 15800u
. For fine-tuning you can also set the resolution directly, for example const uint3 lbm_N = uint3(464u);
The multi-GPU communication has some performance overhead, which shrinks relative to domain compute time the larger the resolution is. The highest possible resolution is the best performing and also the most interesting case for multi-GPU, as at lower resolution a single GPU would be sufficient. But performance at similarly large resolutions should not be too different.
For the single-GPU benchmark the resolution should not matter at all as long as it's sufficiently large for full hardware saturation.
However, the older GCN/Vega GPUs can have vastly different performance for slightly different grid resolution / workgroup count, the cursed memory bandwidth anomaly which is a problem of the hardware architecture. Try some different large resolutions.
Potential bottleneck could be PCIe communication. If you have a server where each GPU is connected by PCIe 3.0 x16 or x8, this should not be a issue. But for example cheap crypto mining hardware with these USB 3 / PCIe 3.0 x1 connections is problematic.
This is what I'm getting with 2 mi25's flashed with wx9100 bios running at pcie3.0-16x. I couldn't let them run all the way through because I don't have proper cooling set up yet. I just wanted to test these before committing to buying more and designing a proper cooling setup. I'm a mechanic by trade and I'm trying to use this to help designing an aero/cooling setup for a long running car project so my apologies if I end up asking incredibly stupid questions, I'm learning as I go.
The small but apparently decently mighty original M1 (2020 MBP).
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \_.-" | | "-._/ / |
| \ _.-" _ "-._ / |
| \.-" _.-" "-._ "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / FluidX3D Version 2.11 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | Apple M1 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Apple M1 |
| Device Vendor | Apple |
| Device Driver | 1.2 1.0 (macOS) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s) |
| Memory, Cache | 10922 MB, 0 KB global / 32 KB local |
| Buffer Limits | 2048 MB global, 1048576 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 377 | 58 GB/s | 22 | 9998 80% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 384 |
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \_.-" | | "-._/ / |
| \ _.-" _ "-._ / |
| \.-" _.-" "-._ "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / FluidX3D Version 2.11 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | Apple M1 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Apple M1 |
| Device Vendor | Apple |
| Device Driver | 1.2 1.0 (macOS) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s) |
| Memory, Cache | 10922 MB, 0 KB global / 32 KB local |
| Buffer Limits | 2048 MB global, 1048576 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 752 | 58 GB/s | 45 | 9999 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 758 |
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \_.-" | | "-._/ / |
| \ _.-" _ "-._ / |
| \.-" _.-" "-._ "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / FluidX3D Version 2.11 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | Apple M1 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Apple M1 |
| Device Vendor | Apple |
| Device Driver | 1.2 1.0 (macOS) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s) |
| Memory, Cache | 10922 MB, 0 KB global / 32 KB local |
| Buffer Limits | 2048 MB global, 1048576 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 755 | 58 GB/s | 45 | 9998 80% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 759 |
Windows 11 NVIDIA GeForce MX450 MLUPs/s 185
Howdy! Benchmark Results below for the new Nvidia L40S being tested in the Marigold Systems Lab, requested from the /r/Nvidia Subreddit.
FP32-16C
FP32-16S
FP32-FP32
Nvidia L40s Dell PowerEdge R760 Ubuntu Server 22.04.3 LTS Nvidia 535.129 Driver
@dboswell-marigoldsystems thank you!!
RTX3050
Wait. Is this right for 3090 FP32/FP16S? I got over 658k MLUPs/s just by changing uint memory to 24000u. Also, for 2 3090s I got 167k MLUPs/s
Is it required to let the memory size to stay at 1488u? Because the 1488u one looks normal to me compared to those on the benchmark sheet.
Also, here are the results using FP32/FP32 on 1488u memory
@lslowmotion for single-GPU, performance is mostly independent of grid size / memory occupation, use the default 256³ / 1488u
MB here.
For multi-GPU benchmarking, larger grid size is a bit faster, because domain communication relative to domain compute time becomes smaller. Since the OS itself needs a few hundred MB of VRAM, 24000
MB, memory allocation will fail (without error message unfortunately), kernels don't actually execute and you get unphysically high scores. Use a bit less than max VRAM capacity, lke, 23500u
. Thanks!
@ProjectPhysX yea with 23000u
now it looks more in line with how it should be. Thanks.
Also to complete the ones above, here are single and dual 3090s in FP32/FP16C to add to the benchmark table. Hope these help!
Hi @lslowmotion,
today I realized that with an optimization in update v2.11, I accidentally stepped on a bug in Nvidia's OpenCL driver, which caused failure of memory allocation for larger simulations, including your benchmark runs at larger resolution. This is now fixed in the master branch! Large resolutions up to 2x ~23000 MB are now working again also with the FP16 types. Apologies for the trouble!
Kind regards, Moritz
These are mi25's flashed with wx9100 bios mounted directly to the board.
Results for the ASUS 4070 Ti Super 16GB card, non overclocked.
RTX 3080 12GB edition - FP16S
RTX 3080 12GB edition - FP16C
RTX 3080 12GB edition - FP32
Here's a multi GPU (technically) result for a Tesla K80 (2 core) GPU. There's a single core K80 (12GB) result in the benchmarks, but now that we have multi GPU functionality here's the 2 core K80 (24GB) result!
| \ / FluidX3D Version 2.14 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce GTX 1060 6GB |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce GTX 1060 6GB |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.161.07 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s) |
| Memory, Cache | 6064 MB, 480 KB global / 48 KB local |
| Buffer Limits | 1516 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 995 | 152 GB/s | 59 | 9997 70% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 997 |
| \ / FluidX3D Version 2.14 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce GTX 1060 6GB |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce GTX 1060 6GB |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.161.07 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s) |
| Memory, Cache | 6064 MB, 480 KB global / 48 KB local |
| Buffer Limits | 1516 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 1924 | 148 GB/s | 115 | 9994 40% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 1925 |
| \ / FluidX3D Version 2.14 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce GTX 1060 6GB |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce GTX 1060 6GB |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.161.07 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s) |
| Memory, Cache | 6064 MB, 480 KB global / 48 KB local |
| Buffer Limits | 1516 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 1772 | 136 GB/s | 106 | 9994 40% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 1785 |
Results on AMD Radeon RX590 8GB (Running on Clover-Mesa OpenCL 1.2)
FP32
FP16C
FP16S
Results on AMD Radeon RX590 8GB (Running on Rusticl-Mesa OpenCL 1.2)
FP32
FP16C
FP16S
So if you want to use an OpenSource OpenCL implementation (Clover or Rusticl) use Clover until Rusticl become better.
Clover by default is OpenCL 1.1 conformant, but you can export:
to use OpenCL 1.2
RoG Strix Laptop:
| \ / FluidX3D Version 2.16 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU |
| Device ID 1 | Intel(R) UHD Graphics 770 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce RTX 3080 Ti Laptop GPU |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 516.40 (Windows) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s) |
| Memory, Cache | 16383 MB, 1624 KB global / 48 KB local |
| Buffer Limits | 4095 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 2972 | 455 GB/s | 177 | 9992 20% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2985 |
Interesting how my Laptop 3080 Ti beats the other Laptops RTX 4080 !
Hi @gitcnd, thanks a lot! Can you please add the FP16S and FP16C benchmarks too? Almost all RTX 40 series GPUs have severely reduced memory bus width and memory bandwidth as compared to their RTX 30 predecessors, making them slower in compute applications.
Sorry about that - here they are:
| \ / FluidX3D Version 2.16 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU |
| Device ID 1 | Intel(R) UHD Graphics 770 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce RTX 3080 Ti Laptop GPU |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 516.40 (Windows) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s) |
| Memory, Cache | 16383 MB, 1624 KB global / 48 KB local |
| Buffer Limits | 4095 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 5832 | 449 GB/s | 348 | 9993 30% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5908 |
| \ / FluidX3D Version 2.16 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU |
| Device ID 1 | Intel(R) UHD Graphics 770 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce RTX 3080 Ti Laptop GPU |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 516.40 (Windows) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s) |
| Memory, Cache | 16383 MB, 1624 KB global / 48 KB local |
| Buffer Limits | 4095 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 5759 | 443 GB/s | 343 | 9983 30% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5780 |
And just for giggles... (the slowest benchmark here so far :-)
| \ / FluidX3D Version 2.16 |
| ' Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU |
| Device ID 1 | Intel(R) UHD Graphics 770 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | Intel(R) UHD Graphics 770 |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 31.0.101.3962 (Windows) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 32 at 1550 MHz (256 cores, 0.794 TFLOPs/s) |
| Memory, Cache | 12955 MB, 1920 KB global / 64 KB local |
| Buffer Limits | 4095 MB global, 4194296 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 243 | 19 GB/s | 14 | 9999 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 246 |
C:\Users\cnd\Downloads\FluidX3D>bin\FluidX3D.exe -h
Lattice Boltzmann CFD software by Dr. Moritz Lehmann
Usage:
bin\FluidX3D.exe [OPTION...]
-h, --help Print help
-x arg X proportion factor (default: 1.0)
-y arg Y proportion factor (default: 1.0)
-z arg Z proportion factor (default: 1.0)
-r, --resolution arg Resolution (default: 4096)
--re arg Reynolds number (default: 100000.0)
-u arg Velocity (default: 0.1)
-t, --time arg Time (default: 10000)
--scale arg Scale (default: 0.9)
-f, --file arg Filename (default: input.stl)
-a, --aoa arg Angle of attack (default: -5.0)
--camx arg Camera X (default: 19.0)
--camy arg Camera Y (default: 19.1)
--camz arg Camera Z (default: 19.2)
--camzoom arg Camera Zoom (default: 1.0)
--camrx arg Camera Rotation X (default: 33.0)
--camry arg Camera Rotation Y (default: 42.0)
--camfov arg Camera Field of View (default: 68.0)
-s, --secs arg Seconds (default: 10.0)
-w, --window Enable window instead of fullscreen mode
--wait Wait for keypress befor ending
--pause Do not auto-start the simulation
-d, --display arg Display (default: 0,1)
@gitcnd Are both DIMM slots on the laptop populated for the Intel iGPU benchmark? If not, the results would be even slower... :smile:
Yes - everything is populated and replaced for max performance (including special low-latency RAM: I replaced the originals).
This was the fastest laptop in the world when I finished upgrading it :-)
RTX6000ADA // Without-ECC
and Threadripper pro 7995WX// Not-OC
Ryzen 7 7800X3D, FP16S
Ryzen 7 7800X3D, FP16C
Ryzen 7 7800X3D, FP32
Hi @roktmanskip, thanks a lot! That's the AMD Radeon Graphics iGPU. What memory speed are you running, and is it 2x 8GB dual channel?
Can you please test the CPU itself as well? I'm curious how it performs. For this, install the Intel CPU Runtime for OpenCL, and then starting the executables from within CMD with the device ID:
cmd
in the address bar and hit EnterFluidX3D-Benchmark-FP32-FP32-Windows.exe 2
FluidX3D-Benchmark-FP32-FP16S-Windows.exe 2
FluidX3D-Benchmark-FP32-FP16C-Windows.exe 2
(you might need a different device index then 2
depending in which order your CPU is listed)
Thanks!
2x16Gb, 6400MHz
You are welcome to report your benchmark results for the FP32/FP16S/FP16C accuracy levels here. Especially numbers for AMD GPUs are desired for GCN/RDNA/RDNA2 architectures. Thank you!