In OpenMC, we sort particle queues by material and energy so as to greatly improve the speed of cross section lookup kernels.
By default, particle queues are transferred back to the host, sorted using a thread-parallel quicksort algorithm on the host CPU, and then transferred back to the device. For NVIDIA and Intel architectures, we have the ability to greatly speed up this process by sorting the particle queues in situ on device. For NVIDIA this is been done by linking to the Thrust library, and for Intel this is done by linking to the OneAPI DPL library.
This PR introduces on-device sorting for AMD GPUs as well by linking to the HIP thrust library. The OpenMC source implementation is actually the exact same as with NVIDIA (they are both just making a single call to thrust::sort), but there is some small divergence in how the cmake subprojects are set up to compile the sorting operation.
In my testing on the JLSE MI250 GPU, for a single GCD on the XXL benchmark and 2M particles in-flight, the performance boost is:
Method
Inactive particles/sec
Host Sorting
175,654
On-Device Sorting (This PR)
250,005
I also added a new compiler preset for the AMD MI250 since we now have some of those nodes at ANL.
How to Enable
The new HIP sorting for AMD can be enabled by adding -Dhip_thrust_sort=on to the cmake line.
Future work
As we are now able to sort on-device for any of the three main GPU manufacturers, we may want to add some logic to cmake (or alter our CMakePresets.json file) to automatically enable the correct sorting library to link to rather than requiring the user to enable this option manually.
Overview
In OpenMC, we sort particle queues by material and energy so as to greatly improve the speed of cross section lookup kernels.
By default, particle queues are transferred back to the host, sorted using a thread-parallel quicksort algorithm on the host CPU, and then transferred back to the device. For NVIDIA and Intel architectures, we have the ability to greatly speed up this process by sorting the particle queues in situ on device. For NVIDIA this is been done by linking to the Thrust library, and for Intel this is done by linking to the OneAPI DPL library.
This PR introduces on-device sorting for AMD GPUs as well by linking to the HIP thrust library. The OpenMC source implementation is actually the exact same as with NVIDIA (they are both just making a single call to
thrust::sort
), but there is some small divergence in how the cmake subprojects are set up to compile the sorting operation.In my testing on the JLSE MI250 GPU, for a single GCD on the XXL benchmark and 2M particles in-flight, the performance boost is:
I also added a new compiler preset for the AMD MI250 since we now have some of those nodes at ANL.
How to Enable
The new HIP sorting for AMD can be enabled by adding
-Dhip_thrust_sort=on
to the cmake line.Future work
As we are now able to sort on-device for any of the three main GPU manufacturers, we may want to add some logic to cmake (or alter our CMakePresets.json file) to automatically enable the correct sorting library to link to rather than requiring the user to enable this option manually.