Closed nR3D closed 5 months ago
Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf
@ChiZhangatTUM Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf
It's an interesting approach thank you, I think it could be worth trying. It makes sense to have a bin counting while particles are inserted, however a single prefix sum over the whole set of cells would probably require a new scan algorithm that coordinates workitems across different workgroups (as of now, the SYCL prefix sum is used, which is a single group operation, but a larger number of bins would require more processors, so the standard scan methods would be too slow). There are already different implementations of system-wide prefix sum that can be considered when developing a SYCL version, alternatively I think that oneDPL already offers such method (but that would add a oneDPL dependency to SPHinXsys).
@ChiZhangatTUM Thx. BTW, would you please have a insight on the counting sort technique? https://on-demand.gputechconf.com/gtc/2014/presentations/S4117-fast-fixed-radius-nearest-neighbor-gpu.pdf
It's an interesting approach thank you, I think it could be worth trying. It makes sense to have a bin counting while particles are inserted, however a single prefix sum over the whole set of cells would probably require a new scan algorithm that coordinates workitems across different workgroups (as of now, the SYCL prefix sum is used, which is a single group operation, but a larger number of bins would require more processors, so the standard scan methods would be too slow). There are already different implementations of system-wide prefix sum that can be considered when developing a SYCL version, alternatively I think that oneDPL already offers such method (but that would add a oneDPL dependency to SPHinXsys).
It was reported that counting sort provides much more speedup, while, you can go back to this after your full implementation of sycl. If the speedup is impressive, adding more dependency is acceptable.
Changes
DeviceRadixSort
Performance
This implementation is now comparable to Intel's oneDPL execution for
size_t
/uint64_t
keys, making it considerable faster than previous implementation (depending on the sorted data size)Other changes
NeighborhoodDevice
for the 3d case has been increased, since the previous value was not sufficient for larger simulations