ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.81k stars 301 forks source link

Report your benchmark results here! #8

Open ProjectPhysX opened 1 year ago

ProjectPhysX commented 1 year ago

You are welcome to report your benchmark results for the FP32/FP16S/FP16C accuracy levels here. Especially numbers for AMD GPUs are desired for GCN/RDNA/RDNA2 architectures. Thank you!

makisukurisu commented 1 year ago

Nvidia RTX 3050 Ti Laptop (M, or Mobile, was used earlier for 3050)

FP16-C: 2253 MLUPs/s image

FP16-S: 2341 MLUPs/s image

FP32: 1181 MLUPs/s image

Kingfire4545 commented 1 year ago

GTX 1650 Laptop GPU FP32'FP16C FP32'FP16S FP32'FP32

Dango3 commented 1 year ago

AMD 5600xt

FP16C: FP16C

FP16S: FP16S

FP32: FP32

einhander commented 1 year ago

Laptop with AMD 5300U FP32/FP16C is dramatically slow :(

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /                FluidX3D Version 2.8 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | AMD Radeon Graphics (renoir, LLVM 15.0.6, DRM 3.52, 6.3.0-1-amd64) |
| Device ID    1 | gfx90c:xnack-                                              |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | gfx90c:xnack-                                              |
| Device Vendor  | Advanced Micro Devices, Inc.                               |
| Device Driver  | 3581.0 (HSA1.1,LC)                                         |
| OpenCL Version | OpenCL C 2.0                                               |
| Compute Units  | 6 at 1500 MHz (384 cores, 1.152 TFLOPs/s)                  |
| Memory, Cache  | 1024 MB, 16 KB global / 64 KB local                        |
| Buffer Limits  | 870 MB global, 891289 KB constant                          |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     145 |     11 GB/s |         9 |         3206  60% |                  0s |
ayaromenok commented 1 year ago

NVidia RTX A4000 (Ampere) short: FP16S: 4945, FP16C:4664, FP32:2500

FP16S:

|                                     \ /                FluidX3D Version 2.9 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA RTX A4000                                           |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA RTX A4000                                           |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.86.10                                                  |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 48 at 1560 MHz (6144 cores, 19.169 TFLOPs/s)               |
| Memory, Cache  | 16106 MB, 1344 KB global / 48 KB local                     |
| Buffer Limits  | 4026 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    4920 |    379 GB/s |       293 |         9990   0% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4945                                                   |

FP16C:

|                                     \ /                FluidX3D Version 2.9 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA RTX A4000                                           |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA RTX A4000                                           |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.86.10                                                  |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 48 at 1560 MHz (6144 cores, 19.169 TFLOPs/s)               |
| Memory, Cache  | 16106 MB, 1344 KB global / 48 KB local                     |
| Buffer Limits  | 4026 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    4516 |    348 GB/s |       269 |         9988  80% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4664                                                   |

FP32:

|                                     \ /                FluidX3D Version 2.9 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA RTX A4000                                           |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA RTX A4000                                           |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.86.10                                                  |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 48 at 1560 MHz (6144 cores, 19.169 TFLOPs/s)               |
| Memory, Cache  | 16106 MB, 1344 KB global / 48 KB local                     |
| Buffer Limits  | 4026 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2496 |    382 GB/s |       149 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2500                                                   |
charismatest commented 1 year ago

NVIDIA T500

FP16S

|                                     \ /                FluidX3D Version 2.9 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Intel(R) Iris(R) Xe Graphics                               |
| Device ID    1 | NVIDIA T500                                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | NVIDIA T500                                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 529.08                                                     |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 14 at 1695 MHz (1792 cores, 6.075 TFLOPs/s)                |
| Memory, Cache  | 4095 MB, 448 KB global / 48 KB local                       |
| Buffer Limits  | 1023 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     573 |     44 GB/s |        34 |         9998  80% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 578

FP16C

|                                     \ /                FluidX3D Version 2.9 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Intel(R) Iris(R) Xe Graphics                               |
| Device ID    1 | NVIDIA T500                                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | NVIDIA T500                                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 529.08                                                     |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 14 at 1695 MHz (1792 cores, 6.075 TFLOPs/s)                |
| Memory, Cache  | 4095 MB, 448 KB global / 48 KB local                       |
| Buffer Limits  | 1023 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     653 |     50 GB/s |        39 |         9998  80% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 665

FP32

|                                     \ /                FluidX3D Version 2.9 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Intel(R) Iris(R) Xe Graphics                               |
| Device ID    1 | NVIDIA T500                                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | NVIDIA T500                                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 529.08                                                     |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 14 at 1695 MHz (1792 cores, 6.075 TFLOPs/s)                |
| Memory, Cache  | 4095 MB, 448 KB global / 48 KB local                       |
| Buffer Limits  | 1023 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     335 |     51 GB/s |        20 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 339
umbertones commented 1 year ago

Nvidia RTX A2000

FP32 grafik

nickrooney commented 1 year ago

AMD Radeon RX 6700 XT

FP32 image

FP16S image

FP16C image

biergaizi commented 10 months ago

I'm reporting the performance of Nvidia CMP 170HX.

It's worth introducing some background first. This is a peculiar mining-special GPU based on the Ampere architecture and the Tesla A100 silicon. I purchased it second hand from a closed mining farm to see if it's any good for running simulations. On one hand, this mining card has Tesla A100's GA100 silicon and 1500 GB/s DRAM bandwidth. On the other hand, Nvidia knows the hardware specs are attractive to all HPC and AI users beyond mining, so they tried as best as they could to make this GPU totally useless for those purposes, this is done by locking everything out besides memory bandwidth and integer operations - memory size, PCIe speed, FP32/FP64 are all almost rendered useless (no pun intended).

In fact, the locked-down FP32 performance is so slow on this GPU, it turns Lattice Boltzmann Method from a memory-bound kernel to a compute-bound kernel, hence, FP32/FP16S and FP32/FP16C are not faster than FP32/FP32... Just unbelievable.

Memory & FLOPS

For a baseline check, the GPU really does have 1200 GB/s of HBM2e, but everything else is locked down.

.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA Graphics Device                                     |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA Graphics Device                                     |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.104.05                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s)               |
| Memory, Cache  | 7961 MB, 1960 KB global / 48 KB local                      |
| Buffer Limits  | 1990 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.189 TFLOPs/s (1/64) |
| FP32  compute                                         0.395 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         2.516  TIOPs/s (1/12) |
| INT32 compute                                        12.493  TIOPs/s (1/2 ) |
| INT16 compute                                        10.013  TIOPs/s (1/3 ) |
| INT8  compute                                        10.219  TIOPs/s (1/3 ) |
| Memory Bandwidth ( coalesced read      )                       1198.53 GB/s |
| Memory Bandwidth ( coalesced      write)                       1336.89 GB/s |
| Memory Bandwidth (misaligned read      )                        793.58 GB/s |
| Memory Bandwidth (misaligned      write)                        139.86 GB/s |
| PCIe   Bandwidth (send                 )                          0.81 GB/s |
| PCIe   Bandwidth (   receive           )                          0.84 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen1 x16)    0.82 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit.                                                  |
'-----------------------------------------------------------------------------'

It's not tested by this benchmark, but according to gpu-burn, you get 6235 GFLOPS with Tensor Cores, which surprisingly is not totally locked down.

$ ./gpu_burn -tc 3600                                                             
Using compare file: compare.ptx
Burning for 3600 seconds.
GPU 0: NVIDIA Graphics Device (UUID: GPU-b40dd576-2299-87fc-7652-05377f5cf6e7)
Initialized device 0 with 7961 MB of memory (7660 MB available, using 6894 MB of it), using FLOATS, using Tensor Cores
Results are 268435456 bytes each, thus performing 24 iterations
10.1%  proc'd: 2064 (6235 Gflop/s)   errors: 0   temps: 32 C 
        Summary at:   Tue Oct 24 04:29:03 UTC 2023

Also, it's worth noting that the reported "Gen1 x16" is incorrect. There are two layers of PCIe bandwidth lockdown. The first layer is the hardware or VBIOS lockdown to PCIe Gen 1, the next is the circuit board level lockdown to x4 from x16 by omitting the AC coupling capacitors. The PCB level lockdown should be trivially bypassable by hardware modification, but still at Gen1 so it would be of limited usefulness.

        GPU Link Info
            PCIe Generation
                Max                       : 2
                Current                   : 1
                Device Current            : 1
                Device Max                : 1
                Host Max                  : 3
            Link Width
                Max                       : 16x
                Current                   : 4x

FP32/FP32

|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA Graphics Device                                     |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA Graphics Device                                     |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.104.05                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s)               |
| Memory, Cache  | 7961 MB, 1960 KB global / 48 KB local                      |
| Buffer Limits  | 1990 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2276 |    348 GB/s |       136 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2276                                                   |

FP32/FP16S

|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA Graphics Device                                     |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA Graphics Device                                     |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.104.05                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s)               |
| Memory, Cache  | 7961 MB, 1960 KB global / 48 KB local                      |
| Buffer Limits  | 1990 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2250 |    173 GB/s |       134 |         9996  60% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2250

FP32/FP16C

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA Graphics Device                                     |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.104.05                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s)               |
| Memory, Cache  | 7961 MB, 1960 KB global / 48 KB local                      |
| Buffer Limits  | 1990 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2266 |    174 GB/s |       135 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2266
ProjectPhysX commented 10 months ago

@biergaizi omg, the 170HX is literally e-waste. Thanks a lot for the benchmarks! Is it possible to remove some of the crippling by modifying the VBIOS?

biergaizi commented 10 months ago

omg, the 170HX is literally e-waste.

I purchased this GPU specifically for developing and experimenting with GPU-accelerated FDTD electromagnetic field simulations. A naive implementation of this algorithm has an Arithmetic Intensity of 0.25 FLOPS/bytes on the far-left side of the roofline curve. Thus, my test using the FDTD kernel shows the GPU still gives me a huge acceleration even with this level of performance lockdown. This is 1.5x faster than the Radeon VII / Instinct MI50, on par with the actual A100. This is currently unmatched by everything else (if rumors are to believed, RTX 5090 is eventually going to match it at 1.5 TB/s in 2024).

Real-world simulations won't be as fast, VRAM and PCIe are going to be serious bottlenecks... Still, not the worst $500 I've spent... (doesn't change the fact that it's still a case of caveat emptor, do not buy unless you know exactly what you're buying...)

Is it possible to remove some of the crippling by modifying the VBIOS?

I'm not well-versed in the GPU modding scene, so this is a question that I also want to know... According to public information, Nvidia has VBIOS digital signature checks since recent years, making VBIOS modification impossible. Last month, a bypass was found for RTX 20-series "Turing" based GPUs, but it's not applicable to Ampere GPUs. I'm afraid that VBIOS modding is not possible.

biergaizi commented 10 months ago

For the purpose of writing a review article, I'm now trying to plot a roofline model of the CMP 170HX. According to the documentation,

363 (FP32/FP32) or 406 (FP32/FP16S) or 1275 (FP32/FP16C) FLOPs per time step (FP32+INT32 operations counted combined)

Thus, 1 MLUPs/s is equivalent to 363 FLOPS, and 2276 MLUPs/s implies a performance of 826.188 GFLOPS? But this is higher than CMP 170HX's FP32 performance of 395 GFLOPS as measured by benchmarks. How can FluidX3D run at a speed above the hardware FP32 limit? The documentation says the FLOPS is in fact "FP32+INT32 operations counted combined". Does it mean that the quoted number "363 FLOPS" does not really represent pure-FP32 operations, but it also includes INT32 operations? If it's true, it would explain my result, since CMP 170HX's integer performance is an uncapped 12 TIOPS. Do you have a detailed breakdown between FP32 and INT32 operations?

ProjectPhysX commented 10 months ago

Hi @biergaizi,

in the FP32/FP32 setting, 1 cell update takes 363 arithmetic operations consisting of 261 FP32 flops + 102 INT32 ops. 1 MLUPs/s means 1 million cell updates per second, so 2276 MLUPs/s means 2276E6 LUPs/s * 363 Flops/LUP = 826 GOps/s, consisting of 594 Flops/s (FP32) and 232 GIops (INT32).

The OpenCL-Benchmark measures Flops only with fused-multiply-add (FMA) instructions that can do 2 Flops/cycle; all other floating-point operations do 1 or less Flops/cycle. It's possible that Nvidia's artificial restrictions on ALUs extend differently across different types of Flops, like FMA, addition, multiplication, division, rsqrt, fmin/fmax, trigonometry etc. Not all Flops are created equal. Likely FMA is crippled the most, which would explain the very low benchmark performance. FluidX3D uses not only FMA but some other operations too, which might not be crippled as much.

Thanks for sharing your super interesting findings here and on Mastodon!!

Kind regards, Moritz

biergaizi commented 10 months ago

Not all Flops are created equal. Likely FMA is crippled the most, which would explain the very low benchmark performance.

Thanks for the hint. I just tried a self-written benchmark in SYCL with and without FMA enabled (disabled via LLVM's -ffp-contract=off). I found CMP 170HX's non-FMA single-precision performance is ~6200 GFLOPS, just like the same number reported by gpu-burn in mixed-precision mode.

Hence, FMA on this card is basically disabled, meanwhile, regular FP multiply or add are limited but still functional, with low but acceptable performance. 6200 GFLOPS is around the speed of a Titan X (Maxwell) or an RTX 2060. A hypothetical RTX 2060 with extremely fast memory is still useful for many HPC simulations. Unfortunately, because nearly all GPU code assumes FMA is functional, there's no way to avoid its use explicitly. Even if it can be manually disabled by modifying the OpenCL runtime, still, if the original code is written with the assumption of FMA (single rounding) instead of MAD (two roundings) in mind - which is almost always the case - it would reduce numerical precision and making the results untrustworthy.

In conclusion, it has become clear that Nvidia's tactic to prevent the CMP 170HX from being used for HPC tasks is a simple but effective trick - disabling FMA to break software compatibility.

Now it makes me wonder if the use of single-rounding FMA is mandatory for numerical integrity in FluidX3D.

ProjectPhysX commented 10 months ago

Hi @biergaizi,

some additional rounding should not affect results too much; main source of error is discretization and not round-off, although round-off is very muc optimized for in the current implementation. You can replace-all fma( with mad( in kernel.cpp and/or remove the -cl-fast-relaxed-math in opencl.hpp. Alternatively, replace fma( with fake_fma( in kernel.cpp and add a macro "\n #define fake_fma(a,b,c) ((a)*(b)+(c))" here.

Kind regards, Moritz

biergaizi commented 10 months ago

You can replace-all fma( with mad( in kernel.cpp and/or remove the -cl-fast-relaxed-math in opencl.hpp.

Both FMA and MAD are restricted on this GPU. Furthermore, my experiment with OpenCL showed that if mad() or fma() has been used explicitly, there's no way to prevent the compiler from generating the corresponding instructions apart from modifying the OpenCL runtime. This is likely true across platforms.

Alternatively, replace fma( with fake_fma( in kernel.cpp and add a macro "\n #define fake_fma(a,b,c) ((a)*(b)+(c))" here.

After this modification, FP32/FP32 performance improved to 2919 MLUPs/s and finally broke the 1 TFLOPS barrier. Unfortunately, the compiler still performs some FMA/MAD transformations by default and there's no way to prevent the compiler from doing that, again, apart from modifying the OpenCL runtime or the compiler. Since the Nvidia OpenCL compiler is proprietary inside libnvidia-opencl.so.1, it's not possible without extensive reverse engineering.

Thus, I started wondering, "is it possible to replace the Nvidia OpenCL compiler with LLVM/clang?", and suddenly I remembered POCL - a free and portable OpenCL runtime based on LLVM/clang with the ability to target Nvidia PTX. I immediately installed POCL and modified the function pocl_llvm_build_program() in its source code to disable FMA/MAD transformations using the old trick of -ffp-contract=off.

And... Success! A modified FluidX3D with all FMA usage removed and a modified POCL with FMA/MAD disabled allowed FluidX3D to unleash the full power of Nvidia's GA100 silicon, at least in FP32/FP32 mode! :rocket:

|----------------.------------------------------------------------------------|
| Device ID      | 2                                                          |
| Device Name    | NVIDIA Graphics Device                                     |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 4.0                                                        |
| OpenCL Version | OpenCL C 1.2 PoCL                                          |
| Compute Units  | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s)               |
| Memory, Cache  | 7961 MB, 0 KB global / 48 KB local                         |
| Buffer Limits  | 1990 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    7681 |   1175 GB/s |       458 |         9985  50% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 7684 

But FP32/FP16S and FP32/FP16C modes are not faster than FP32/FP32 (although faster than their unmodified versions), possibly because of similar problems in the floating-point operations involved. I wonder if they too can be worked around.

FP32/FP16S:

|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    7315 |    563 GB/s |       436 |         9983  30% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 7316                                                   |

FP32/FP16C:

|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    4882 |    376 GB/s |       291 |         9990   0% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4883                                                   |
ProjectPhysX commented 10 months ago

Hi @biergaizi,

amazing that it worked! Not the first instance where PoCL beat Nvidia's own runtime. 🖖😛 You can try inserting "\n #pragma OPENCL FP_CONTRACT OFF" here and see if this fixes the bad performance on the Nvidia compiler.

Kind regards, Moritz

biergaizi commented 10 months ago

You can try inserting "\n #pragma OPENCL FP_CONTRACT OFF" here and see if this fixes the bad performance on the Nvidia compiler.

This worked perfectly. It even fixed the performance problem of FP32/FP16S on the Nvidia compiler (PoCL has low performance probably because of a code generation problem). Now the performance in both cases are close to the Nvidia A100! The only exception is FP32/FP16C - the custom floating-point format probably either increased the arithmetic intensity beyond the FP32 non-FMA limit, or hit other restrictions.

The CMP 170HX suddenly has its killer app now.

FP32/FP32

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA Graphics Device                                     |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.104.05                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s)               |
| Memory, Cache  | 7961 MB, 1960 KB global / 48 KB local                      |
| Buffer Limits  | 1990 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    7583 |   1160 GB/s |       452 |         9994  40% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 7585                                                   |

FP32/FP16S

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|   12386 |    954 GB/s |       738 |         9997  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 12392                                                  |

FP32/FP16C

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    6853 |    528 GB/s |       408 |         9985  50% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 6859                                                   |

Patch

Even better is the fact that the minimum changes needed for this workaround is just a two-line patch:

diff --git a/src/lbm.cpp b/src/lbm.cpp
index d99202f..28aeb25 100644
--- a/src/lbm.cpp
+++ b/src/lbm.cpp
@@ -286,6 +286,8 @@ void LBM_Domain::enqueue_unvoxelize_mesh_on_device(const Mesh* mesh, const uchar
 }

 string LBM_Domain::device_defines() const { return
+       "\n     #pragma OPENCL FP_CONTRACT OFF"  // prevents implicit FMA optimizations
+       "\n     #define fma(a, b, c) ((a) * (b) + (c))"  // shadows OpenCL explicit function fma()
        "\n     #define def_Nx "+to_string(Nx)+"u"
        "\n     #define def_Ny "+to_string(Ny)+"u"
        "\n     #define def_Nz "+to_string(Nz)+"u"
SphaeroX commented 10 months ago

D3Q19 SRT (FP32/FP32)

` |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA GeForce RTX 4080 Laptop GPU | | Device Vendor | NVIDIA Corporation | | Device Driver | 537.42 | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 58 at 2280 MHz (7424 cores, 33.853 TFLOPs/s) | | Memory, Cache | 12281 MB, 1624 KB global / 48 KB local | | Buffer Limits | 3070 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | Info: Allocating memory. This may take a few seconds. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | Grid Domains | 1 x 1 x 1 = 1 | | LBM Type | D3Q19 SRT (FP32/FP32) | | Memory Usage | CPU 272 MB, GPU 1x 1488 MB | | Max Alloc Size | 1216 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 2544 | 389 GB/s | 152 | 9992 20% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 2577 |

`

D3Q19 SRT (FP32/FP16S)

| Info: Peak MLUPs/s = 5086

D3Q19 SRT (FP32/FP16C)

| Info: Peak MLUPs/s = 5114

fiftyfathoms commented 9 months ago

Haven't seen results for Nvidia A30.

OpenCL Benchmark

.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA A30                                                 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A30                                                 |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.129.03                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s)               |
| Memory, Cache  | 24062 MB, 1568 KB global / 48 KB local                     |
| Buffer Limits  | 6015 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         5.053 TFLOPs/s (1/2 ) |
| FP32  compute                                        10.215 TFLOPs/s ( 1x ) |
| FP16  compute                                          not supported        |
| INT64 compute                                         1.990  TIOPs/s (1/4 ) |
| INT32 compute                                        10.285  TIOPs/s ( 1x ) |
| INT16 compute                                         8.158  TIOPs/s (2/3 ) |
| INT8  compute                                         8.316  TIOPs/s (2/3 ) |
| Memory Bandwidth ( coalesced read      )                        806.94 GB/s |
| Memory Bandwidth ( coalesced      write)                        900.40 GB/s |
| Memory Bandwidth (misaligned read      )                        651.78 GB/s |
| Memory Bandwidth (misaligned      write)                         80.94 GB/s |
| PCIe   Bandwidth (send                 )                         19.16 GB/s |
| PCIe   Bandwidth (   receive           )                         13.22 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   12.30 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit.                                                  |
'-----------------------------------------------------------------------------'

FP32/FP16C

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.10 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA A30                                                 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A30                                                 |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.129.03                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s)               |
| Memory, Cache  | 24062 MB, 1568 KB global / 48 KB local                     |
| Buffer Limits  | 6015 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    5712 |    440 GB/s |       340 |         9994  40% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5726

FP32/FP16S

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.10 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA A30                                                 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A30                                                 |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.129.03                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s)               |
| Memory, Cache  | 24062 MB, 1568 KB global / 48 KB local                     |
| Buffer Limits  | 6015 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    9718 |    748 GB/s |       579 |         9993  30% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 9721                                                   |

FP32/FP32

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.10 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA A30                                                 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A30                                                 |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.129.03                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s)               |
| Memory, Cache  | 24062 MB, 1568 KB global / 48 KB local                     |
| Buffer Limits  | 6015 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    5002 |    765 GB/s |       298 |         9997  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5004                                                   |
Willian-Zhang commented 9 months ago

Apple M1 Ultra 128G FP32

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.10 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1 Ultra                                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1 Ultra                                             |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0                                                    |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s)               |
| Memory, Cache  | 98304 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 18432 MB global, 1048576 KB constant                       |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    4448 |    681 GB/s |       265 |         9987  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4519                                                   |

FP16S


|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1 Ultra                                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1 Ultra                                             |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0                                                    |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s)               |
| Memory, Cache  | 98304 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 18432 MB global, 1048576 KB constant                       |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    8286 |    638 GB/s |       494 |         9995  50% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 8418                                                   |

FP16C

|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1 Ultra                                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1 Ultra                                             |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0                                                    |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s)               |
| Memory, Cache  | 98304 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 18432 MB global, 1048576 KB constant                       |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    6794 |    523 GB/s |       405 |         9979  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 6915                                                   |
marcc1229 commented 9 months ago

How is everyone doing the benchmarks for multi gpu configurations? I'm playing around with mi25's and not seeing anywhere near what the specs would suggest I should. I'm wondering if I have a hardware bottleneck or if I missed something in the setup.

ProjectPhysX commented 9 months ago

@marcc1229 use the "2/4/8 GPUs" lines in the benchmark setup, and for memory use a value close to the VRAM capacity of one GPU, like 15800u. For fine-tuning you can also set the resolution directly, for example const uint3 lbm_N = uint3(464u);

The multi-GPU communication has some performance overhead, which shrinks relative to domain compute time the larger the resolution is. The highest possible resolution is the best performing and also the most interesting case for multi-GPU, as at lower resolution a single GPU would be sufficient. But performance at similarly large resolutions should not be too different.

For the single-GPU benchmark the resolution should not matter at all as long as it's sufficiently large for full hardware saturation.

However, the older GCN/Vega GPUs can have vastly different performance for slightly different grid resolution / workgroup count, the cursed memory bandwidth anomaly which is a problem of the hardware architecture. Try some different large resolutions.

Potential bottleneck could be PCIe communication. If you have a server where each GPU is connected by PCIe 3.0 x16 or x8, this should not be a issue. But for example cheap crypto mining hardware with these USB 3 / PCIe 3.0 x1 connections is problematic.

marcc1229 commented 9 months ago

image This is what I'm getting with 2 mi25's flashed with wx9100 bios running at pcie3.0-16x. I couldn't let them run all the way through because I don't have proper cooling set up yet. I just wanted to test these before committing to buying more and designing a proper cooling setup. I'm a mechanic by trade and I'm trying to use this to help designing an aero/cooling setup for a long running car project so my apologies if I end up asking incredibly stupid questions, I'm learning as I go.

Alex-Vasile commented 9 months ago

The small but apparently decently mighty original M1 (2020 MBP).

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.11 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1                                                   |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1                                                   |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0 (macOS)                                            |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s)                 |
| Memory, Cache  | 10922 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 2048 MB global, 1048576 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     377 |     58 GB/s |        22 |         9998  80% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 384                                                    |
.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.11 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1                                                   |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1                                                   |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0 (macOS)                                            |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s)                 |
| Memory, Cache  | 10922 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 2048 MB global, 1048576 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     752 |     58 GB/s |        45 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 758                                                    |
.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.11 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1                                                   |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1                                                   |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0 (macOS)                                            |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s)                 |
| Memory, Cache  | 10922 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 2048 MB global, 1048576 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     755 |     58 GB/s |        45 |         9998  80% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 759                                                    |
jbruck commented 8 months ago

image Windows 11 NVIDIA GeForce MX450 MLUPs/s 185

dboswell-marigoldsystems commented 8 months ago

Howdy! Benchmark Results below for the new Nvidia L40S being tested in the Marigold Systems Lab, requested from the /r/Nvidia Subreddit.

FP32-16C FP32-16C

FP32-16S FP32-16S

FP32-FP32 FP32-FP32

FluidX3D Benchmark.docx

Nvidia L40s Dell PowerEdge R760 Ubuntu Server 22.04.3 LTS Nvidia 535.129 Driver

marigoldsystems.com

ProjectPhysX commented 8 months ago

@dboswell-marigoldsystems thank you!!

Jake1402 commented 8 months ago

RTX3050 image

lslowmotion commented 7 months ago

image Wait. Is this right for 3090 FP32/FP16S? I got over 658k MLUPs/s just by changing uint memory to 24000u. image Also, for 2 3090s I got 167k MLUPs/s

Is it required to let the memory size to stay at 1488u? Because the 1488u one looks normal to me compared to those on the benchmark sheet. image image

Also, here are the results using FP32/FP32 on 1488u memory image image

ProjectPhysX commented 7 months ago

@lslowmotion for single-GPU, performance is mostly independent of grid size / memory occupation, use the default 256³ / 1488u MB here. For multi-GPU benchmarking, larger grid size is a bit faster, because domain communication relative to domain compute time becomes smaller. Since the OS itself needs a few hundred MB of VRAM, 24000 MB, memory allocation will fail (without error message unfortunately), kernels don't actually execute and you get unphysically high scores. Use a bit less than max VRAM capacity, lke, 23500u. Thanks!

lslowmotion commented 7 months ago

@ProjectPhysX yea with 23000u now it looks more in line with how it should be. Thanks. image

Also to complete the ones above, here are single and dual 3090s in FP32/FP16C to add to the benchmark table. Hope these help! image image

ProjectPhysX commented 7 months ago

Hi @lslowmotion,

today I realized that with an optimization in update v2.11, I accidentally stepped on a bug in Nvidia's OpenCL driver, which caused failure of memory allocation for larger simulations, including your benchmark runs at larger resolution. This is now fixed in the master branch! Large resolutions up to 2x ~23000 MB are now working again also with the FP16 types. Apologies for the trouble!

Kind regards, Moritz

marcc1229 commented 7 months ago

These are mi25's flashed with wx9100 bios mounted directly to the board. x3dbench

gryoung4727 commented 7 months ago

Results for the ASUS 4070 Ti Super 16GB card, non overclocked.

cmd_pwPtwWGKbE cmd_qIN1aBHeNd cmd_vlamptkJpO

mckirkus commented 6 months ago

RTX 3080 12GB edition - FP16S image

RTX 3080 12GB edition - FP16C image

RTX 3080 12GB edition - FP32 image

SLGY commented 6 months ago

Here's a multi GPU (technically) result for a Tesla K80 (2 core) GPU. There's a single core K80 (12GB) result in the benchmarks, but now that we have multi GPU functionality here's the 2 core K80 (24GB) result! FP32-FP16C FP32-FP16S FP32-FP32

chconnor commented 6 months ago
|                                     \ /               FluidX3D Version 2.14 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce GTX 1060 6GB                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce GTX 1060 6GB                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.161.07 (Linux)                                         |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s)                |
| Memory, Cache  | 6064 MB, 480 KB global / 48 KB local                       |
| Buffer Limits  | 1516 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     995 |    152 GB/s |        59 |         9997  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 997                                                    |

|                                     \ /               FluidX3D Version 2.14 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce GTX 1060 6GB                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce GTX 1060 6GB                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.161.07 (Linux)                                         |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s)                |
| Memory, Cache  | 6064 MB, 480 KB global / 48 KB local                       |
| Buffer Limits  | 1516 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    1924 |    148 GB/s |       115 |         9994  40% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 1925                                                   |
|                                     \ /               FluidX3D Version 2.14 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce GTX 1060 6GB                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce GTX 1060 6GB                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.161.07 (Linux)                                         |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s)                |
| Memory, Cache  | 6064 MB, 480 KB global / 48 KB local                       |
| Buffer Limits  | 1516 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    1772 |    136 GB/s |       106 |         9994  40% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 1785                                                   |
matteocavestri commented 4 months ago

Results on AMD Radeon RX590 8GB (Running on Clover-Mesa OpenCL 1.2)

FP32 FP32

FP16C FP16C

FP16S FP16S

matteocavestri commented 4 months ago

Results on AMD Radeon RX590 8GB (Running on Rusticl-Mesa OpenCL 1.2)

FP32 FP32-rusticl

FP16C FP16C-rusticl

FP16S FP16S-rusticl

So if you want to use an OpenSource OpenCL implementation (Clover or Rusticl) use Clover until Rusticl become better.

Clover by default is OpenCL 1.1 conformant, but you can export:

to use OpenCL 1.2

gitcnd commented 3 months ago

RoG Strix Laptop:

|                                     \ /               FluidX3D Version 2.16 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 516.40 (Windows)                                           |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s)               |
| Memory, Cache  | 16383 MB, 1624 KB global / 48 KB local                     |
| Buffer Limits  | 4095 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2972 |    455 GB/s |       177 |         9992  20% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2985                                                   |

pic_2024-05-20_22 00 23_569

Interesting how my Laptop 3080 Ti beats the other Laptops RTX 4080 !

ProjectPhysX commented 3 months ago

Hi @gitcnd, thanks a lot! Can you please add the FP16S and FP16C benchmarks too? Almost all RTX 40 series GPUs have severely reduced memory bus width and memory bandwidth as compared to their RTX 30 predecessors, making them slower in compute applications.

gitcnd commented 3 months ago

Sorry about that - here they are:

|                                     \ /               FluidX3D Version 2.16 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 516.40 (Windows)                                           |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s)               |
| Memory, Cache  | 16383 MB, 1624 KB global / 48 KB local                     |
| Buffer Limits  | 4095 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    5832 |    449 GB/s |       348 |         9993  30% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5908                                                   |

|                                     \ /               FluidX3D Version 2.16 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 516.40 (Windows)                                           |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s)               |
| Memory, Cache  | 16383 MB, 1624 KB global / 48 KB local                     |
| Buffer Limits  | 4095 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    5759 |    443 GB/s |       343 |         9983  30% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5780                                                   |
gitcnd commented 3 months ago

And just for giggles... (the slowest benchmark here so far :-)

|                                     \ /               FluidX3D Version 2.16 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | Intel(R) UHD Graphics 770                                  |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 31.0.101.3962 (Windows)                                    |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 32 at 1550 MHz (256 cores, 0.794 TFLOPs/s)                 |
| Memory, Cache  | 12955 MB, 1920 KB global / 64 KB local                     |
| Buffer Limits  | 4095 MB global, 4194296 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     243 |     19 GB/s |        14 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 246                                                    |

C:\Users\cnd\Downloads\FluidX3D>bin\FluidX3D.exe -h
Lattice Boltzmann CFD software by Dr. Moritz Lehmann
Usage:
  bin\FluidX3D.exe [OPTION...]

  -h, --help            Print help
  -x arg                X proportion factor (default: 1.0)
  -y arg                Y proportion factor (default: 1.0)
  -z arg                Z proportion factor (default: 1.0)
  -r, --resolution arg  Resolution (default: 4096)
      --re arg          Reynolds number (default: 100000.0)
  -u arg                Velocity (default: 0.1)
  -t, --time arg        Time (default: 10000)
      --scale arg       Scale (default: 0.9)
  -f, --file arg        Filename (default: input.stl)
  -a, --aoa arg         Angle of attack (default: -5.0)
      --camx arg        Camera X (default: 19.0)
      --camy arg        Camera Y (default: 19.1)
      --camz arg        Camera Z (default: 19.2)
      --camzoom arg     Camera Zoom (default: 1.0)
      --camrx arg       Camera Rotation X (default: 33.0)
      --camry arg       Camera Rotation Y (default: 42.0)
      --camfov arg      Camera Field of View (default: 68.0)
  -s, --secs arg        Seconds (default: 10.0)
  -w, --window          Enable window instead of fullscreen mode
      --wait            Wait for keypress befor ending
      --pause           Do not auto-start the simulation
  -d, --display arg     Display (default: 0,1)
biergaizi commented 3 months ago

@gitcnd Are both DIMM slots on the laptop populated for the Intel iGPU benchmark? If not, the results would be even slower... :smile:

gitcnd commented 3 months ago

Yes - everything is populated and replaced for max performance (including special low-latency RAM: I replaced the originals).

RoG Benchmark 2022-08-25

This was the fastest laptop in the world when I finished upgrading it :-)

GiyuuTH commented 3 months ago

RTX6000ADA // Without-ECC

GPUFP16C GPUFP16S GPUFP32

and Threadripper pro 7995WX// Not-OC

CPUFP16C CPUFP16S CPUFP32

roktmansean commented 2 months ago

Ryzen 7 7800X3D, FP16S pic0

Ryzen 7 7800X3D, FP16C image

Ryzen 7 7800X3D, FP32 image

ProjectPhysX commented 2 months ago

Hi @roktmanskip, thanks a lot! That's the AMD Radeon Graphics iGPU. What memory speed are you running, and is it 2x 8GB dual channel?

Can you please test the CPU itself as well? I'm curious how it performs. For this, install the Intel CPU Runtime for OpenCL, and then starting the executables from within CMD with the device ID:

Thanks!

roktmansean commented 2 months ago

2x16Gb, 6400MHz

image

image

image