nR3D commented 5 months ago

Changes

Improved performance of SYCL particle sorting by executing global buckets scan in a separate kernel
Reduce memory allocation by reusing data buffers whenever sorting size is unchanged from the last use. For this purpose, sorting methods relative to SYCL have been moved inside a new class DeviceRadixSort

Performance

This implementation is now comparable to Intel's oneDPL execution for size_t/uint64_t keys, making it considerable faster than previous implementation (depending on the sorted data size)

benchmark

Other changes

Allocation size of NeighborhoodDevice for the 3d case has been increased, since the previous value was not sufficient for larger simulations

DrChiZhang commented 5 months ago

Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf

nR3D commented 5 months ago

@ChiZhangatTUM Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf

It's an interesting approach thank you, I think it could be worth trying. It makes sense to have a bin counting while particles are inserted, however a single prefix sum over the whole set of cells would probably require a new scan algorithm that coordinates workitems across different workgroups (as of now, the SYCL prefix sum is used, which is a single group operation, but a larger number of bins would require more processors, so the standard scan methods would be too slow). There are already different implementations of system-wide prefix sum that can be considered when developing a SYCL version, alternatively I think that oneDPL already offers such method (but that would add a oneDPL dependency to SPHinXsys).

DrChiZhang commented 5 months ago

@ChiZhangatTUM Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf

It's an interesting approach thank you, I think it could be worth trying. It makes sense to have a bin counting while particles are inserted, however a single prefix sum over the whole set of cells would probably require a new scan algorithm that coordinates workitems across different workgroups (as of now, the SYCL prefix sum is used, which is a single group operation, but a larger number of bins would require more processors, so the standard scan methods would be too slow). There are already different implementations of system-wide prefix sum that can be considered when developing a SYCL version, alternatively I think that oneDPL already offers such method (but that would add a oneDPL dependency to SPHinXsys).

It was reported that counting sort provides much more speedup, while, you can go back to this after your full implementation of sycl. If the speedup is impressive, adding more dependency is acceptable.

Xiangyu-Hu / SPHinXsys

[SYCL] Improved SYCL particle sorting #480

Changes

Performance

Other changes