OP-DSL / OP2-Common

OP2: open-source framework for the execution of unstructured grid applications on clusters of GPUs or multi-core CPUs
https://op-dsl.github.io
Other
98 stars 47 forks source link

Race condition in "op_cuda_reduction.h" #185

Closed m-8k closed 3 years ago

m-8k commented 3 years ago

NVIDIA's "Racecheck" reports a race condition in op_cuda_reduction.h. I think it's correct: Each iteration of the loop in op_cuda_reduction.h reads the element at index tid + d and therefore requires that the thread computing it is done.

https://github.com/OP-DSL/OP2-Common/blob/614866e3792dd930b534e5e35815d7d7464aeca9/op2/c/include/op_cuda_reduction.h#L119-L139

Of course, this thread has the ID tid + d, and therefore the executing thread tid needs to wait for thread tid + d to finish its task. As far as I can see this problem can be fixed by a call to __syncwarp in each iteration of the loop. Let me know if you would like me to submit a pull request.

reguly commented 3 years ago

Yes that looks right. This code comes from pre-cooperative groups times, and has not been updated... Please do submit the pull request. Thanks for spotting this!