This PR removes an atomic read operation from the find_cell_inner() function. While the read is acceptable, I was finding that some compilers were using overly strict memory ordering with the
#pragma omp atomic read
operation. Specifically, Intel was implementing this as an atomic compare-and-swap operation, which is a lot more expensive than a native atomic read. Removing the atomic sped up the cross surface kernel by about 2x. It did not result in any performance change on AMD or NVIDIA. Newer versions of the Intel compiler are supposed to have a more intelligent implementation that eliminates the performance problem, but given the logic I've added to the comments of this section, I think it is better to just not use an atomic at all so as to remove the potential for a compiler having a performance regression in its atomics affecting us in the future.
This PR removes an atomic read operation from the
find_cell_inner()
function. While the read is acceptable, I was finding that some compilers were using overly strict memory ordering with theoperation. Specifically, Intel was implementing this as an atomic compare-and-swap operation, which is a lot more expensive than a native atomic read. Removing the atomic sped up the cross surface kernel by about 2x. It did not result in any performance change on AMD or NVIDIA. Newer versions of the Intel compiler are supposed to have a more intelligent implementation that eliminates the performance problem, but given the logic I've added to the comments of this section, I think it is better to just not use an atomic at all so as to remove the potential for a compiler having a performance regression in its atomics affecting us in the future.