ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.77k stars 300 forks source link

Memory usage #71

Closed randomwangran closed 1 year ago

randomwangran commented 1 year ago

I am currently test the code by checking memory usage.

With a setup like this

| Memory Usage    |                             CPU 26112 MB, GPU 8x 20438 MB | 

It looks like using 8*20G memory on 8 GPUs, but it seems only run on GPU ID0, nvidia-smi

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     99561      C   ./bin/FluidX3D                  40332MiB |
|    1   N/A  N/A     99561      C   ./bin/FluidX3D                    414MiB |
|    2   N/A  N/A     99561      C   ./bin/FluidX3D                    414MiB |
|    3   N/A  N/A     99561      C   ./bin/FluidX3D                    414MiB |

The code is running:

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|   13526 |   2069 GB/s |         8 |           40 100% |                  0s |   

Not sure what's going on here.

ProjectPhysX commented 1 year ago

You have selected 8 domains in the LBM constructor: LBM lbm(Nx, Ny, Nz, 2u, 2u, 2u, ...); (2x2x2=8) but there is only 4 GPUs available. FluidX3D will automatically try to find 8 identical GPUs and assign one domain to each of them. If this is not possible (either not enough GPUs or the GPUs are different models), it then assigns all domains to the single fastest GPU, and print a warming "Not enough devices of the same type available. Using single fastest device for all domains.". This happens in your case, with 8 domains but only 4 GPUs available.

I suggest you reduce the number of domains to 4. One domain per GPU gives best performance with the least communications overhead.

But you can also manually assign multiple domains to fewer GPUs: bin/FluidX3D 0 0 1 1 2 2 3 3, or with the compile+run script ./make.sh 0 0 1 1 2 2 3 3. This will assign 2 domains to each of your 4 GPUs.