Xiangyu-Hu / SPHinXsys

SPHinXsys provides C++ APIs for engineering simulation and optimization. It aims at complex systems driven by fluid, structure, multi-body dynamics and beyond. The multi-physics library is based on a unique and unified computational framework by which strong coupling has been achieved for all involved physics.
https://www.sphinxsys.org/
Apache License 2.0
259 stars 199 forks source link

[SYCL] Improved SYCL particle sorting #480

Closed nR3D closed 5 months ago

nR3D commented 5 months ago

Changes

Performance

This implementation is now comparable to Intel's oneDPL execution for size_t/uint64_t keys, making it considerable faster than previous implementation (depending on the sorted data size)

benchmark

Other changes

DrChiZhang commented 5 months ago

Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf

nR3D commented 5 months ago

@ChiZhangatTUM Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf

It's an interesting approach thank you, I think it could be worth trying. It makes sense to have a bin counting while particles are inserted, however a single prefix sum over the whole set of cells would probably require a new scan algorithm that coordinates workitems across different workgroups (as of now, the SYCL prefix sum is used, which is a single group operation, but a larger number of bins would require more processors, so the standard scan methods would be too slow). There are already different implementations of system-wide prefix sum that can be considered when developing a SYCL version, alternatively I think that oneDPL already offers such method (but that would add a oneDPL dependency to SPHinXsys).

DrChiZhang commented 5 months ago

@ChiZhangatTUM Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf

It's an interesting approach thank you, I think it could be worth trying. It makes sense to have a bin counting while particles are inserted, however a single prefix sum over the whole set of cells would probably require a new scan algorithm that coordinates workitems across different workgroups (as of now, the SYCL prefix sum is used, which is a single group operation, but a larger number of bins would require more processors, so the standard scan methods would be too slow). There are already different implementations of system-wide prefix sum that can be considered when developing a SYCL version, alternatively I think that oneDPL already offers such method (but that would add a oneDPL dependency to SPHinXsys).

It was reported that counting sort provides much more speedup, while, you can go back to this after your full implementation of sycl. If the speedup is impressive, adding more dependency is acceptable.