Race condition in "op_cuda_reduction.h"

NVIDIA's "Racecheck" reports a race condition in op_cuda_reduction.h. I think it's correct: Each iteration of the loop in op_cuda_reduction.h reads the element at index tid + d and therefore requires that the thread computing it is done.

https://github.com/OP-DSL/OP2-Common/blob/614866e3792dd930b534e5e35815d7d7464aeca9/op2/c/include/op_cuda_reduction.h#L119-L139

Of course, this thread has the ID tid + d, and therefore the executing thread tid needs to wait for thread tid + d to finish its task. As far as I can see this problem can be fixed by a call to __syncwarp in each iteration of the loop. Let me know if you would like me to submit a pull request.

OP-DSL / OP2-Common

Race condition in "op_cuda_reduction.h" #185