ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.48k stars 281 forks source link

Struct of Arrays vs Arrays of Structs: Possible Oversight? #193

Closed Meerkov closed 1 month ago

Meerkov commented 1 month ago

https://github.com/ProjectPhysX/FluidX3D/blob/18738d9ef4fa1d58a06c9f34d337962657d53d13/src/kernel.cpp#L911

Hi, I've been very interested in this project. I noticed that the index here seems to organize memory like this:

  1. Node[0] Velocity[0]
  2. Node[1] Velocity[0]
  3. Node[2] Velocity[0] ...
  4. Node[0] Velocity[1]
  5. Node[1] Velocity[1] 6...

In my mind, this seems like that would mean that a single GPU thread would need to touch data that is very far apart. Each thread only needs one Velocity[0], so this guarantees a cache miss. However, if it was organized like this:

  1. Node[0] Velocity[0]
  2. Node[0] Velocity[1]
  3. Node[0] Velocity[2] ...

Then this would guarantee half the data needed for each Pull is colocated.

That's how I read it in the paper, but the code doesn't agree.

See:

ulong index_f(const uxx n, const uint i) { // 64-bit indexing (maximum 2^32 lattice points (1624^3 lattice resolution, 225GB)
    return (ulong)i*def_N+(ulong)n; // SoA (229% faster on GPU)
}

https://github.com/ProjectPhysX/FluidX3D/blob/18738d9ef4fa1d58a06c9f34d337962657d53d13/src/lbm.cpp#L294C40-L295C45

    "\n #define def_N "+to_string(get_N())+"ul"

https://github.com/ProjectPhysX/FluidX3D/blob/18738d9ef4fa1d58a06c9f34d337962657d53d13/src/lbm.hpp#L117C85-L118C103

    ulong get_N() const { return (ulong)Nx*(ulong)Ny*(ulong)Nz; } // get (local) number of lattice points

There is a cryptic comment which says "SoA 229% faster on GPU", but is that accurate for this workload? The goal is co-location of data, and SoA guarantees memory is in Q different locations (e.g. 27 in D3Q27), but shouldn't it only need to touch 14 locations in AoS?

Meerkov commented 1 month ago

I tested on my machine with 64x64 and D2Q9. I found maybe a 0.5% speedup when using SoA.

On 256x256 D2Q9 I see maybe 0.1% speedup. I think you're correct that it's faster, but I'm not sure what test was run to get the 229% speedup.

ProjectPhysX commented 1 month ago

Hi @Meerkov,

you can easily test SoA vs. AoS with the standard 256³ BENCHMARK setup, by modifying the index_f() function here:

)+R(ulong index_f(const uxx n, const uint i) { // 64-bit indexing
    //return (ulong)i*def_N+(ulong)n; // SoA
    return (ulong)n*(ulong)def_velocity_set+(ulong)i; // AoS
}

You will see that SoA is indeed >2x faster on GPUs (and also faster on CPUs).


But why is SoA faster, when the data is not colocated?

The OpenCL code is vectorized. Blocks of 64 threads are executed at once, and each tread computes one cell. With SoA, for each DEnsity Distribution Function (DDF) f0, f1, ... individually, consecutive threads T0, T1, ... access consecutive memory loactions, resulting in coalesced memory access - the only way to achieve peak memory bandwidth:

Structure of Arrays (SoA) memory layout (memory access is coalesced):

        .------------.------------.--- cell 0 ---
        |  .---------|--.---------|--.--- cell 1 ---
        |  |  .------|--|--.------|--|--.--- cell 2 ---
        |  |  |      |  |  |      |  |  |
   DDFs f0 f0 f0 ... f1 f1 f1 ... f2 f2 f2 ...
        |  |  |      |  |  |      |  |  |
Threads T0 T1 T2 ... T0 T1 T2 ... T0 T1 T2 ...

With AoS, although the data is colocated for grid cells, consecutive GPU threads no longer access consecutive memory addresses, and memory coalescing fails:

Array of Structures (AoS) memory layout (no memory coalescence):

        .-- cell 0 --.   .-- cell 1 --.   .-- cell 2 --. 
        |            |   |            |   |            | 
   DDFs f0 f1 f2 ... f18 f0 f1 f2 ... f18 f0 f1 f2 ... f18 ...
        |  |  |      |   |  |  |      |   |  |  |      |
Threads T0 T0 T0 ... T0  T1 T1 T1 ... T1  T2 T2 T2 ... T2  ...

Kind regards, Moritz