NVIDIA's "Racecheck" reports a race condition in op_cuda_reduction.h. I think it's correct: Each iteration of the loop in op_cuda_reduction.h reads the element at index tid + d and therefore requires that the thread computing it is done.
Of course, this thread has the ID tid + d, and therefore the executing thread tid needs to wait for thread tid + d to finish its task. As far as I can see this problem can be fixed by a call to __syncwarp in each iteration of the loop. Let me know if you would like me to submit a pull request.
Yes that looks right. This code comes from pre-cooperative groups times, and has not been updated... Please do submit the pull request. Thanks for spotting this!
NVIDIA's "Racecheck" reports a race condition in
op_cuda_reduction.h
. I think it's correct: Each iteration of the loop inop_cuda_reduction.h
reads the element at indextid + d
and therefore requires that the thread computing it is done.https://github.com/OP-DSL/OP2-Common/blob/614866e3792dd930b534e5e35815d7d7464aeca9/op2/c/include/op_cuda_reduction.h#L119-L139
Of course, this thread has the ID
tid + d
, and therefore the executing threadtid
needs to wait for threadtid + d
to finish its task. As far as I can see this problem can be fixed by a call to__syncwarp
in each iteration of the loop. Let me know if you would like me to submit a pull request.