Closed Meerkov closed 1 month ago
I tested on my machine with 64x64 and D2Q9. I found maybe a 0.5% speedup when using SoA.
On 256x256 D2Q9 I see maybe 0.1% speedup. I think you're correct that it's faster, but I'm not sure what test was run to get the 229% speedup.
Hi @Meerkov,
you can easily test SoA vs. AoS with the standard 256³ BENCHMARK
setup, by modifying the index_f()
function here:
)+R(ulong index_f(const uxx n, const uint i) { // 64-bit indexing
//return (ulong)i*def_N+(ulong)n; // SoA
return (ulong)n*(ulong)def_velocity_set+(ulong)i; // AoS
}
You will see that SoA is indeed >2x faster on GPUs (and also faster on CPUs).
The OpenCL code is vectorized. Blocks of 64 threads are executed at once, and each tread computes one cell. With SoA, for each DEnsity Distribution Function (DDF) f0, f1, ...
individually, consecutive threads T0, T1, ...
access consecutive memory loactions, resulting in coalesced memory access - the only way to achieve peak memory bandwidth:
Structure of Arrays (SoA) memory layout (memory access is coalesced):
.------------.------------.--- cell 0 ---
| .---------|--.---------|--.--- cell 1 ---
| | .------|--|--.------|--|--.--- cell 2 ---
| | | | | | | | |
DDFs f0 f0 f0 ... f1 f1 f1 ... f2 f2 f2 ...
| | | | | | | | |
Threads T0 T1 T2 ... T0 T1 T2 ... T0 T1 T2 ...
With AoS, although the data is colocated for grid cells, consecutive GPU threads no longer access consecutive memory addresses, and memory coalescing fails:
Array of Structures (AoS) memory layout (no memory coalescence):
.-- cell 0 --. .-- cell 1 --. .-- cell 2 --.
| | | | | |
DDFs f0 f1 f2 ... f18 f0 f1 f2 ... f18 f0 f1 f2 ... f18 ...
| | | | | | | | | | | |
Threads T0 T0 T0 ... T0 T1 T1 T1 ... T1 T2 T2 T2 ... T2 ...
Kind regards, Moritz
https://github.com/ProjectPhysX/FluidX3D/blob/18738d9ef4fa1d58a06c9f34d337962657d53d13/src/kernel.cpp#L911
Hi, I've been very interested in this project. I noticed that the index here seems to organize memory like this:
In my mind, this seems like that would mean that a single GPU thread would need to touch data that is very far apart. Each thread only needs one Velocity[0], so this guarantees a cache miss. However, if it was organized like this:
Then this would guarantee half the data needed for each Pull is colocated.
That's how I read it in the paper, but the code doesn't agree.
See:
https://github.com/ProjectPhysX/FluidX3D/blob/18738d9ef4fa1d58a06c9f34d337962657d53d13/src/lbm.cpp#L294C40-L295C45
https://github.com/ProjectPhysX/FluidX3D/blob/18738d9ef4fa1d58a06c9f34d337962657d53d13/src/lbm.hpp#L117C85-L118C103
There is a cryptic comment which says "SoA 229% faster on GPU", but is that accurate for this workload? The goal is co-location of data, and SoA guarantees memory is in Q different locations (e.g. 27 in D3Q27), but shouldn't it only need to touch 14 locations in AoS?