ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.86k stars 305 forks source link

10-15% Speedup by enqueuing more at a time #231

Open Meerkov opened 3 weeks ago

Meerkov commented 3 weeks ago

https://github.com/ProjectPhysX/FluidX3D/blob/584f10a382b47cdec5972bcae27bbc83a8c70b23/src/lbm.cpp#L851

Tested on 2D Taylor Green Vortex

By default, I get something around 2400-2500 Steps per Second. I'll use 2490 as my starting FPS.

I added the following simple modification.

    for (uint d = 0u; d < get_D(); d++)
    {
        for (uint step = 0; step < 4; step++) {
            lbm_domain[d]->increment_time_step();
            lbm_domain[d]->enqueue_stream_collide(); // run LBM stream_collide kernel after domain communication
        }

    }

This enqueues 4 steps at a time, before doing a blocking synchronization step.

On my PC, this now will show me as having 692 Steps/s, which multiplied by 4, is 2768 (since the machine is confused due to the domain running 4x steps when the output is only expected 1).

2768/2490 is just about 11% speedup.

You can enqueue more at a time, say 100 steps per iteration.

Now the output says it's 29 Steps/s implying it's running at a slightly faster 2900 FPS. (16% speedup). The downside however is now you're probably only rendering just under 30 FPS (at 100 *29 steps per second) instead of 60 FPS.

Probably an ideal solution would be to dynamically change the number of steps enqueued whenever the FPS is above 60.

Edit: Make sure you remove the lbm_domain[d]->increment_time_step(); that's called after synchronization to keep the timestep count correct.

Meerkov commented 3 weeks ago

Benchmark: Before (enqueue 1 at a time) - 1000 x 1000 runs

|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                      32 x 32 x 32 = 32768 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                     CPU 0 MB, GPU 1x 2 MB |
| Max Alloc Size  |                                                      2 MB |
| Time Steps      |                                                      1000 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                            Re < 18.475208 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     934 |    143 GB/s |     28514 |       999687  69% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 905                                                    |

After (enqueue 100 at a time) - 1000 x 10 runs

|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                      32 x 32 x 32 = 32768 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                     CPU 0 MB, GPU 1x 2 MB |
| Max Alloc Size  |                                                      2 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                            Re < 18.475208 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|      24 |      4 GB/s |       718 |       996723 7230% |                  0s ||
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 24                                                     |

Notice the current step is the same (1 million total steps) but the MLUPs / Bandwidth / Steps-per-sec are all 100x lower than they really are:


|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2400 |    400 GB/s |     71800 |       996723 7230% |                  0s ||
|---------'-------------'-----------'-------------------'---------------------|

This implies enqueueing these commands creates (in this toy example which takes nearly no GPU time) a 250% speedup.

Meerkov commented 3 weeks ago

If I fix the stats, and then test on a more strenuous benchmark (e.g. the default 256x256x256) the benefit goes away. That makes sense because the benefit should effect smaller simulations that need to synchronize too often compared to the work:

|                                     \ /               FluidX3D Version 2.17 | // With custom enqueuing modification
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3060                                    |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 3060                                    |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 552.22 (Windows)                                           |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 28 at 1867 MHz (3584 cores, 13.383 TFLOPs/s)               |
| Memory, Cache  | 12287 MB, 784 KB global / 48 KB local                      |
| Buffer Limits  | 3071 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    4012 |    309 GB/s |       239 |      1000000 10000% |                 13s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4013                                                   |```

This MLUPs matches other benchmarks for the 3060, so the benefit here likely matters only for smaller sims that suffer from unnecessary CPU overhead.