Closed andygill closed 10 years ago
CUDA has only compare-and-swap style atomic instructions, so we can't create a lock so that a thread can do multiple reads/writes atomically. This means that tuple components must be updated individually (since we have a struct-of-array representation in memory) and why you are seeing them fall out of sync.
Not sure if there is a way around that, but open to suggestions.
Thanks for the response. Perhaps we should restrict the types of arguments to permute
to at least stop this problem hitting others. I'll rewrite my code to use a fold, which should not have the same issue.
Just FYI, fold had the same issue. So I quantized the double into 16 bits, and packed the pair into a 32-bit Word and it worked (because there is a single lock).
That sounds a bit odd for fold
, do you still have your test program?
Permute on 64-bit types will work as well, at least for compute 1.2 hardware and above.
This program gives different results for the interpreter and CUDA. The combination function is associative and commutative.
I suspect the atomic lock(s) on the tuple update is not being handle properly, and the tuples are getting out of sync.