Closed adayton1 closed 3 months ago
@MrBurmark, would you be willing to look at the changes to hip atomics? The previous implementation of atomicCAS relied on volatile, so I changed the implementation to use an atomic load. To avoid a circular dependency, I've changed in the implementation of atomicLoad to use the intrinsic if available, otherwise I fall back to atomicOr(address, 0). Does that make sense, or would it be better to fall back to atomicCAS(address, 0, 0) or something else? I think atomicOr will be better than atomicAdd, but I'm not sure if atomicCAS can avoid the write in some cases. I also modified atomicExchange to use reinterpret casting instead of atomicCAS in a loop, then made atomicStore use atomicExchange if the intrinsic is not available.
If these changes make sense, then I will do basically the same thing in CUDA and then we can fully get rid of volatile.
Its best to avoid using atomicCAS(0,0)
where possible as its slower than doing something like atomicOr(0)
or atomicAdd(0)
. Between atomicOr(0)
or atomicAdd(0)
I don't know which is faster, maybe #1624 could provide some insight?
I'll be out most of this afternoon and all day tomorrow, so if the tests come back passing, feel free to merge when you think it is ready.
This is looking pretty good. You double checked which types were available in which hardware for cuda and hip? It looks like the cuda and hip backend are more similar now but they are still a bit different on which types are supported for which hardware.
This is looking pretty good. You double checked which types were available in which hardware for cuda and hip? It looks like the cuda and hip backend are more similar now but they are still a bit different on which types are supported for which hardware.
Yeah, I double checked the types. The main differences are that Hip doesn't provide an atomicInc or atomicDec, and CUDA supports an additional type for atomicMin and atomicMax.
I keep hitting unrelated errors in the CI:
[info: cloning spack develop branch from github]
[exe: git clone --single-branch --depth=1 -b develop-2024-05-26 https://github.com/spack/spack.git spack]
Cloning into 'spack'...
error: RPC failed; curl 18 transfer closed with outstanding read data remaining
error: 3939 bytes of body are still expected
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
[spack python: /bin/sh: /dev/shm/lassen53-1939417/spack/bin/spack: No such file or directory]
[Checking for concretizer options...]
[disabling config scope (except defaults) in: /dev/shm/lassen53-1939417/spack/lib/spack/spack/config.py]
Traceback (most recent call last):
File "./scripts/uberenv/uberenv.py", line 1389, in
Summary
__CUDA_ARCH__
is less than 350 since RAJA requires that as the minimum supported architecture anyway