ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.95k stars 317 forks source link

Report your benchmark results here! #8

Open ProjectPhysX opened 2 years ago

ProjectPhysX commented 2 years ago

You are welcome to report your benchmark results for the FP32/FP16S/FP16C accuracy levels here. Especially numbers for AMD GPUs are desired for GCN/RDNA/RDNA2 architectures. Thank you!

ibonito1 commented 2 years ago

I'd love to add to the benchmarks list. I've got two questions:

  1. I want to benchmark a dual Epyc system (so specifically the CPUs actually). How would I do that (under Windows, but Linux would also be fine), if I have a GPU installed? It always automatically detects the GPU when running the benchmark “releases”.
  2. How to post the benchmarks? Just copy the console output in here?

Cheers!

ProjectPhysX commented 2 years ago

Hi ibonito1,

OpenCL support on EPYC CPUs is a bit difficult as these are not officially supported by AMD. Being x86-64, they should work with the Intel OpenCL CPU Runtime though, or alternatively with POCL. Fingers crossed! To run on a specific device, in the console run ./FluidX3D.exe 2 (on Linux) or FluidX3D.exe 2 (on Windows), to select device with ID 2 for example. You can just copy the console output here.

Regards, Moritz

C-Dub2022 commented 2 years ago

AMD Radeon RX 580: image

ProjectPhysX commented 2 years ago

C-Dub2022 thank you very much for the RX 580 benchmark! If you can post the FP16S and FP16C benchmarks as well, I'll add them to the readme!

C-Dub2022 commented 2 years ago

Hopefully this is helpful. Let me know if there is anything else I can do.

image image

MarcoAurelioFerrari commented 2 years ago

RTX 3060 12GB - v1.1

FP32-FP16C FP32-FP16C

FP32-FP16S FP32-FP16S

FP32-FP32 FP32-FP32

ProjectPhysX commented 2 years ago

MarcoAurelioFerrari thank you!

dongwang22 commented 2 years ago

Could you please tell me how to open the visualized interface of the flow domain as you said in the readme file? You said input the 2 can turn on the velocity field, but it does not work in the benchmark case. How can I generate pictures like you prensent on twitter ? image

ProjectPhysX commented 2 years ago

Hi dongwang22,

thanks for the benchmark! For the visual interface, uncomment #define WINDOWS_GRAPHICS and comment #define BENCHMARK in src/defines.hpp, and uncomment for example the Taylor-Green setup in src/setup.cpp. Then compile and you should see the graphical interface where you can toggle rendering modes with keys 1/2/3/4. To generate videos, see the other setups: basically make a C++ loop and repeatedly do some LBM time steps and render images with the corresponding methods of the LBM class.

Regards, Moritz

fkay1 commented 2 years ago

AMD 5700 XT

|----------------.------------------------------------------------------------| | Device ID 0 | gfx1010:xnack- | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | gfx1010:xnack- | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3444.0 (PAL,LC) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 20 at 1905 MHz (2560 cores, 9.754 TFLOPs/s) | | Memory, Cache | 8176 MB, 16 KB global / 64 KB local | | Buffer Limits | 6949 MB global, 7116390 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP32) | | Memory Usage | CPU 272 MB, GPU 1488 MB | | Max Alloc Size | 1216 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 1366 | 209 GB/s | 81 | 9996 60% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 1368 |

|----------------.------------------------------------------------------------| | Device ID 0 | gfx1010:xnack- | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | gfx1010:xnack- | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3444.0 (PAL,LC) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 20 at 1905 MHz (2560 cores, 9.754 TFLOPs/s) | | Memory, Cache | 8176 MB, 16 KB global / 64 KB local | | Buffer Limits | 6949 MB global, 7116390 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP16S) | | Memory Usage | CPU 272 MB, GPU 880 MB | | Max Alloc Size | 608 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 3253 | 250 GB/s | 194 | 9988 80% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 3253 |

|----------------.------------------------------------------------------------| | Device ID 0 | gfx1010:xnack- | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | gfx1010:xnack- | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3444.0 (PAL,LC) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 20 at 1905 MHz (2560 cores, 9.754 TFLOPs/s) | | Memory, Cache | 8176 MB, 16 KB global / 64 KB local | | Buffer Limits | 6949 MB global, 7116390 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP16C) | | Memory Usage | CPU 272 MB, GPU 880 MB | | Max Alloc Size | 608 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 3044 | 234 GB/s | 181 | 9992 20% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 3049 |

funlennysub commented 2 years ago

FP32/FP16C FluidX3D-Benchmark-FP32-FP16C-Windows_GvHd6N7oB6

FP32/FP16S FluidX3D-Benchmark-FP32-FP16S-Windows_90ejyVLfVG

FP32/FP32 FluidX3D-Benchmark-FP32-FP32-Windows_W9hOfLroLA

nicandris commented 2 years ago

RTX 2080 SUPER image image image

gittigittibangbang commented 2 years ago

I tried a 6900XT, but the score is lower than anticipated. The max bandwidth seems to be limited to 300GB/s, although GPUZ says it's connected via PCIe 4.0 16x and should top out at 512GB/s. The GPU clock is at 2540MHz and the memory clock at 2000MHz. GPU and memory controller loads are at 100%.

image image image

With the 3D Taylor-Green model and FP32/FP16S, the MLUPs/s and the bandwidth go through the roof. I'll try some other models, too. FP32/FP32 goes up to 2400 MLUPs/s and 370GB/s, with FP32/FP16C it's 9000 MLUPs/s and 700GB/s. image

ProjectPhysX commented 2 years ago

Hi gittigittibangbang, thanks for the benchmarks! Efficiency is ~60% which is typical for the AMD GPUs. Performance is limited by VRAM bandwidth only, and the RX 6800 would presumably perform exactly the same. The benchmark setup is a 256³ box, that fills 1.5GB (FP32) or 0.9GB (FP16) of VRAM. The large infinity cache (128MB) is only an insignificant fraction of that so does not significantly boost performance. With a smaller 128³ box however, which only fills 186MB (FP32) or 76MB (FP16), almost the entire grid fits in the cache and effective bandwidth is much larger.

HighDoping commented 2 years ago

Vega 8 in R7 4750G |----------------.------------------------------------------------------------| | Device ID 0 | gfx90c | | Device ID 1 | gfx90c | | Device ID 2 | gfx90c | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | gfx90c | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3380.6 (PAL,HSAIL) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 8 at 2100 MHz (512 cores, 2.150 TFLOPs/s) | | Memory, Cache | 26899 MB, 16 KB global / 32 KB local | | Buffer Limits | 19382 MB global, 19847731 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP32) | | Memory Usage | CPU 272 MB, GPU 1488 MB | | Max Alloc Size | 1216 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 246 | 38 GB/s | 15 | 9999 90% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 263 |

|----------------.------------------------------------------------------------| | Device ID 0 | gfx90c | | Device ID 1 | gfx90c | | Device ID 2 | gfx90c | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | gfx90c | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3380.6 (PAL,HSAIL) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 8 at 2100 MHz (512 cores, 2.150 TFLOPs/s) | | Memory, Cache | 26899 MB, 16 KB global / 32 KB local | | Buffer Limits | 19382 MB global, 19847731 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP16S) | | Memory Usage | CPU 272 MB, GPU 880 MB | | Max Alloc Size | 608 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 505 | 39 GB/s | 30 | 9998 80% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 511 |

|----------------.------------------------------------------------------------| | Device ID 0 | gfx90c | | Device ID 1 | gfx90c | | Device ID 2 | gfx90c | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | gfx90c | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3380.6 (PAL,HSAIL) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 8 at 2100 MHz (512 cores, 2.150 TFLOPs/s) | | Memory, Cache | 26899 MB, 16 KB global / 32 KB local | | Buffer Limits | 19382 MB global, 19847731 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP16C) | | Memory Usage | CPU 272 MB, GPU 880 MB | | Max Alloc Size | 608 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 466 | 36 GB/s | 28 | 9998 80% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 501 |

edmond1992 commented 2 years ago

Is it possible to add ready-to-run benchmark for MacOS so we can get more result on Mac? Especially the test is bandwidth limited and Apple silicon should be good at this. Not to mention relatively cheap 64GB+ VRAM as they share the same main memory.

edmond1992 commented 2 years ago

RTX3060 Laptop GPU with 12700H on ASUS ROG M16 Turbo mode (120W GPU TDP) and external laptop fan PS C:\Software\FluidX3D> .\FluidX3D-Benchmark-FP32-FP32-Windows.exe .-----------------------------------------------------------------------------. | __ __ | | \ ____ | | ____ / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \ | | | | / / | | \ _.-" | | "-./ / | | \ .-" "-. / | | .-" .-" "-. "-./ | | .-" .-"-. "-. | | \ v" "v / | | \ \ / / | | \ \ / / | | \ \ / / | | \ ' / | | \ / | | \ / | | ' ╕ Moritz Lehmann | |----------------.------------------------------------------------------------| | Device ID 0 | NVIDIA GeForce RTX 3060 Laptop GPU | | Device ID 1 | Intel(R) Iris(R) Xe Graphics | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA GeForce RTX 3060 Laptop GPU | | Device Vendor | NVIDIA Corporation | | Device Driver | 512.78 | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 30 at 1425 MHz (3840 cores, 10.944 TFLOPs/s) | | Memory, Cache | 6143 MB, 840 KB global / 48 KB local | | Buffer Limits | 1535 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP32) | | Memory Usage | CPU 272 MB, GPU 1488 MB | | Max Alloc Size | 1216 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 2014 | 308 GB/s | 120 | 9999 90% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 2019 |

PS C:\Software\FluidX3D> .\FluidX3D-Benchmark-FP32-FP16C-Windows.exe .-----------------------------------------------------------------------------. | __ __ | | \ ____ | | ____ / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \ | | | | / / | | \ _.-" | | "-./ / | | \ .-" "-. / | | .-" .-" "-. "-./ | | .-" .-"-. "-. | | \ v" "v / | | \ \ / / | | \ \ / / | | \ \ / / | | \ ' / | | \ / | | \ / | | ' ╕ Moritz Lehmann | |----------------.------------------------------------------------------------| | Device ID 0 | NVIDIA GeForce RTX 3060 Laptop GPU | | Device ID 1 | Intel(R) Iris(R) Xe Graphics | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA GeForce RTX 3060 Laptop GPU | | Device Vendor | NVIDIA Corporation | | Device Driver | 512.78 | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 30 at 1425 MHz (3840 cores, 10.944 TFLOPs/s) | | Memory, Cache | 6143 MB, 840 KB global / 48 KB local | | Buffer Limits | 1535 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP16C) | | Memory Usage | CPU 272 MB, GPU 880 MB | | Max Alloc Size | 608 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 3523 | 271 GB/s | 210 | 9996 60% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 3572 |

PS C:\Software\FluidX3D> .\FluidX3D-Benchmark-FP32-FP16S-Windows.exe .-----------------------------------------------------------------------------. | __ __ | | \ ____ | | ____ / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \ | | | | / / | | \ _.-" | | "-./ / | | \ .-" "-. / | | .-" .-" "-. "-./ | | .-" .-"-. "-. | | \ v" "v / | | \ \ / / | | \ \ / / | | \ \ / / | | \ ' / | | \ / | | \ / | | ' ╕ Moritz Lehmann | |----------------.------------------------------------------------------------| | Device ID 0 | NVIDIA GeForce RTX 3060 Laptop GPU | | Device ID 1 | Intel(R) Iris(R) Xe Graphics | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA GeForce RTX 3060 Laptop GPU | | Device Vendor | NVIDIA Corporation | | Device Driver | 512.78 | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 30 at 1425 MHz (3840 cores, 10.944 TFLOPs/s) | | Memory, Cache | 6143 MB, 840 KB global / 48 KB local | | Buffer Limits | 1535 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | LBM Type | D3Q19 SRT (FP32/FP16S) | | Memory Usage | CPU 272 MB, GPU 880 MB | | Max Alloc Size | 608 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 3991 | 307 GB/s | 238 | 9989 90% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 4012 |

PS C:\Software\FluidX3D>

ProjectPhysX commented 2 years ago

@HAL9000COM thanks for the Vega 8 benchmarks! Quick question: Is your RAM is 2x16GB DDR4-3200MT/s? And do you have an idea why the GPU shows up 3 times?

ProjectPhysX commented 2 years ago

Is it possible to add ready-to-run benchmark for MacOS so we can get more result on Mac? Especially the test is bandwidth limited and Apple silicon should be good at this. Not to mention relatively cheap 64GB+ VRAM as they share the same main memory.

@edmond1992 unfortunately I don't have a Mac, so I can't compile add the executables for MacOS. But the code should work as-is; just compile it as-is with the third line in make.sh and you'll get the FP32 benchmark. Uncomment FP16S/FP16C in src/defines.hpp and recompile to get the other 2 benchmarks.

edmond1992 commented 2 years ago

Cross compile?

Sent from my iPhone

On 23 Oct 2022, at 16:06, Moritz Lehmann @.***> wrote:



Is it possible to add ready-to-run benchmark for MacOS so we can get more result on Mac? Especially the test is bandwidth limited and Apple silicon should be good at this. Not to mention relatively cheap 64GB+ VRAM as they share the same main memory.

@edmond1992https://github.com/edmond1992 unfortunately I don't have a Mac, so I can't compile add the executables for MacOS. But the code should work as-is; just compile it as-is with the third line in make.sh and you'll get the FP32 benchmark. Uncomment FP16S/FP16C in src/defines.hpp and recompile to get the other 2 benchmarks.

— Reply to this email directly, view it on GitHubhttps://github.com/ProjectPhysX/FluidX3D/issues/8#issuecomment-1288045906, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALNCZ732JE6QCL7RKB37QG3WETWXJANCNFSM6AAAAAAQTGZVRY. You are receiving this because you were mentioned.Message ID: @.***>

[https://www.polyu.edu.hk/emaildisclaimer/85A-PolyU_Email_Signature.jpg]

Disclaimer:

This message (including any attachments) contains confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this message and notify the sender and The Hong Kong Polytechnic University (the University) immediately. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited and may be unlawful.

The University specifically denies any responsibility for the accuracy or quality of information obtained through University E-mail Facilities. Any views and opinions expressed are only those of the author(s) and do not necessarily represent those of the University and the University accepts no liability whatsoever for any losses or damages incurred or caused to any party as a result of the use of such information.

HighDoping commented 2 years ago

@HAL9000COM thanks for the Vega 8 benchmarks! Quick question: Is your RAM is 2x16GB DDR4-3200MT/s? And do you have an idea why the GPU shows up 3 times?

2x32GB DDR4-3200 OC to 3533. No idea why GPU shows up multiple times. After some reboot, it now shows up as two devices.

skoz90 commented 2 years ago

image image image

Nvidia Quadro RTX 5000

SLGY commented 2 years ago

GTX 1050 on an old gaming laptop. It's amazing I figured out how to even run this and get a benchmark. Now I'm going to try and figure out how to run the simulation on an stl (or similar) file. I know how to use Blender quite well, but this is my first time with visial studio or command line stuff. I'm so out of my depth here 😟

Screenshot (103)

SLGY commented 2 years ago

@ProjectPhysX have now added the FP16 benchmarks

RTX 3080 Ti

Updated FP32 (was concurrently baking a fluid in Blender when I ran the last one): FP32

FP16S: FP16S

FP16C: FP16C

ProjectPhysX commented 2 years ago

Hi @SirWixy, thank you so much for the benchmarks! Can you post the FP16S and FP16C results too?

gittigittibangbang commented 2 years ago

Quadro RTX 4000 below. I also tried two Xeon Gold 5218 (2x16 cores), with the FP32/FP32 benchmark they top out at 126MLUPs/s, 20GB/s and 8 steps/s. I did not have the patience to run it to the end. The speedup with GPUs is really dramatic, damn.

image image image

ProjectPhysX commented 2 years ago

@gittigittibangbang thanks for the benchmarks! For the CPU you can just stop it with Ctrl+C after it has leveled at constant performance, and take the last MLUPs/s reading. Can you post the program header with the Xeon Gold for the specs, and performance values for FP16S and FP16C too for the Xeon? Thanks!

gittigittibangbang commented 2 years ago

|----------------.------------------------------------------------------------| | Device ID 0 | Quadro RTX 4000 | | Device ID 1 | Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 1 | | Device Name | Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz | | Device Vendor | Intel(R) Corporation | | Device Driver | 6.4.0.37 | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 32 at 2300 MHz (16 cores, 1.178 TFLOPs/s) | | Memory, Cache | 261766 MB, 256 KB global / 32 KB local | | Buffer Limits | 65441 MB global, 128 KB constant

FP32/FP32: 132MLUPs/s, 20GB/s bandwidth, 8 steps/s FP32/FP16C: 270MLUPs/s, 21GB/s bandwidth, 16 steps/s FP32/FP16S: 135MLUPs/s, 10GB/s bandwidth, 8 steps/s

ProjectPhysX commented 2 years ago

@gittigittibangbang last question: What is the RAM configuration on the Xeon Gold 5218? 8x 32GB DDR4 2667MT/s in quad-channel?

gittigittibangbang commented 2 years ago

Yes, 8x32GB DDR4 at 2667MHz, but apparently only dual channel according to CPUZ. It seems there's something amiss in the UEFI settings, it should be quad channel.

Michallote commented 2 years ago

Hey thanks so much for the help setting things up. I have ran the benchmarks on my GPU. I was very curious to see what would turn out to be the performance. My GPU is NVIDIA RTX 2060 KO. Which is a version which used higher quality chips that didn't pass the test to become RTX 2080s. So the actual chip is an TU104 (same as 2080 RTX and Quadro 4000) unlike most RTX 2060 that have a TU106 and as everything else is the same it could be a decent comparison of those Graphics processors:

FP32-FP32 image

FP32-FP16S image

FP32-FP16C image

However this results might be a bit lower than they should because the max bandwith of this GPU is 336.0 GB/s and it ran only at about 250.0 GB/s, do anybody know if this is normal? I had a couple of apps open. I might re run this later with the PC completely unloaded. In the meantime we can see the difference between RTX 2060 and RTX 2060 Super is huge!

ProjectPhysX commented 2 years ago

Hi @Michallote, many thanks for the 2060 KO benchmarks! The GPU chip itself does not matter too much. Performance purely follows memory bandwidth. You're getting ~75% efficiency which is typical for the Nvidia cards. It's due to the Esoteric-Pull swap algorithm using some misaligned write operations which are not at full bandwidth, for the benefit of cutting memory demand in half. The 2060 Super has 33% higher bandwidth and that reflects in performance.

Blightbuster commented 2 years ago

GTX 1080 Ti

image image image

rodionstepanov commented 2 years ago

RTX 3090 Ti Doc1.pdf As we see bandwidth does not exceeds 873GBps. However the specification tells it should be 1018GBps at max. Taking your estimate that a single lattice point requires 1241 (FP32/FP16C) FLOPs per time step we obtain only 13.3 TFLOPs/s instead of 40. Am I right?

ProjectPhysX commented 2 years ago

Hi @rodionstepanov, thanks for the 3090 Ti benchmarks!! FluidX3D is bandwidth bound, so it uses all* the available memory bandwidth, but only a small fraction of the available TFLOPs/s. If you compare 2 GPUs with the same bandwidth, for example 3060 Ti and 2060 Super (both 448GB/s), they will perform the same in FluidX3D, despite the 3060 Ti having >double the TFLOPs/s of the 2060 Super.

*You see only ~837GB/s instead of 1008GB/s because the Esoteric-Pull streaming algorithm I use requires some misaligned write operations that cannot be at full bandwidth. You're at 87% overall efficiency which is very good already.

The alternative to Esoteric-Pull would be the One-Step-Pull streaming algorithm, that avoids all misaligned write operations and can actually reach 100% efficiency on modern GPUs. However it's drawbacks are that it a) requires double the VRAM capacity for the same grid resolution and b) needs to load flags of neighboring grid points during streaming, so overall performance is actually lower than with Esoteric-Pull despite better efficiency. See this paper for details.

rodionstepanov commented 2 years ago

@ProjectPhysX that is clear. I defiantly prefer higher resolutions so Esoteric-Pull is my choice. Since GPU is underloaded it could be reasonable to use more sophisticated algorithm which requires more FLOPs per lattice per step and does not need bandwidth. For example an increase stability and etc would be nice.

ProjectPhysX commented 2 years ago

@rodionstepanov the (relative to FLOPs) underperforming memory is a big problem across a lot of HPC software. Chip development progresses much faster than memory development for over a decade now; the FLOPs/Byte ratio is ever increasing. Using the "spare" FLOPs to improve model accuracy without performance loss is a common strategy. I'm already leveraging that with FP16 memory compression, for 2-8x increase in FLOPs/Byte by cutting memory access in half and using spare FLOPs for number conversion to the more accurate FP16C format. Still it's all bandwidth-bound. Another possibility with LBM is a more sophisticated collision operator. So far though, the simple SRT/BGK collision has proven best for both accuracy and stability. I'll look into cumulant and central moment operators in the future.

atesteve commented 2 years ago

This is a GTX 1650 on a laptop (under Linux): fp32 fp16c fp16s

trparry commented 2 years ago

image image image

ProjectPhysX commented 2 years ago

@trparry thank you so much! That 80GB A100 absolutely shreds! Can please you post the FP16S and FP16C benchmarks as well?

trparry commented 2 years ago

@ProjectPhysX yep! Just added them to original post.

IllesHUN commented 2 years ago

RTX 3050 laptop GPU

Also, I cant find the place where the .stl files for the setups have to be. i've done everything but its always just ... .stl does not exist

D3Q9_FP32 D3Q9_FP16S D3Q9_FP16C

Maere05 commented 2 years ago

Hi, Go to: C:\FluidX3D-master\bin\ Then create a folder called: "stl" and put your .stl files in there (only binary). In setup.cpp change the lbm.voxelize_stl argument to be "stl/myFilename.stl" Cheers

IllesHUN commented 2 years ago

Thanks, I was doing that already, but I found out what the problem was. I am using the provided setup for the f1 car and it had two extra dots in front of the directory that i had to remove (took 3 straight hours to notice it).

And I have another question. after I have got the F1 car model to get voxelized and appear visually I started the simulation and i noticed that when using the 4 key visualizing (isosurface i think) it didn't show (visible for me) anything like it did when I tested the delta wing which is built in to the code and not a .stl file, I also noticed that the simulation time is moving much slower, and I don't know if something is wrong or I shouldn't even except the same results.

Thanks for all the help in advance. I have a middle school level of C# knowledge so it pretty hard to understand whats happening but at least im not clueless, also I started discovering CFD basically 2 days ago, so sorry if Im asking stupid questions.

kendrickxy commented 2 years ago

The RTX 3080 TI performed a little better than expected on FS16S:

FS16S SRT:

MBench_FS16S_D3Q19_SRT

TRT:

MBench_FS16S_D3Q19_TRT

FS16C: SRT:

MBench_FS16C_D3Q19_SRT

TRT:

MBench_FS16C_D3Q19_TRT

Overall the same score with 256 grid resolution

MarcoAurelioFerrari commented 2 years ago

Just to confirm what was expected: RTX2060 TU106 image image image

MarcoAurelioFerrari commented 2 years ago

GTX 1660 image image image

lsvvt commented 2 years ago

RTX 3070 image image image

NarodGaming commented 2 years ago

Apple M1 Pro (10 Core CPU / 16 Core GPU / 16GB RAM)

Not bad for ~200GB/s memory bandwidth, though definitely low on the FP16C.

Screenshot 2022-11-08 at 20 47 04 Screenshot 2022-11-08 at 20 48 01 Screenshot 2022-11-08 at 20 54 03
ConfusedWizard commented 2 years ago

RTX 4090 FP32_FP32 FP32_FP16S FP32_FP16C