Open jglaser opened 5 years ago
This had a PR (#155). It looks like it closed during the reorganizations.
We should resurrect the PR, rebase it on main
, and take a look, this sounds important.
@jglaser I know this issue is several years old, but do you happen to have some code that reproduces this or a recollection of how this situation occurred for you?
I'll summarize your report to make sure I understand it:
WarpScanShfl::InclusiveScan
using a LOGICAL_WARP_SIZE
less than 32.shfl.sync.up.b32
.member_mask
.Basically, this: https://www.godbolt.org/z/T6jrx75M6
I'm trying to find a repro where the compiler would implement the branch using an execution mask instead of branching so I can test this, but it's not clear how this would happen.
We would need to get a reproducer on this to keep this open. @jglaser could you provide a recent reproducer?
I am afraid that I am not actively working on the CUB code inside HOOMD-blue which would allow me to test the above behavior again. Feel free to close, if I remember correctly it may have been a false positive with cuda-memcheck. If not, can always reopen later.
When using
WarpScanShfl
fromwarp_scan_shfl.cuh
inside awhile()
loop and in conjunction with a sub-warpLOGICAL_WARP_THREADS
argument, i.e.LOGICAL_WARP_THREADS=2^n
withn<5
, I get lots of errors like these withcuda-memcheck --tool synccheck
I believe the root cause is the following.
WarpScanShfl
sets itsmember_mask
for theshfl_sync
to reflect the sub-warp membership. However, what happens if some threads exit early, the compiler may predicate off this initialization statement, leaving member_mask in an invalid state. Later, when the PTX assembly instructionshfl.sync.idx.b32
is hit, it is executed without predicate (such as@p
) and thus with a wrong mask. Then cuda-memcheck finds that the executing warp lane executes an implicitsyncwarp
but without its mask bits set, and issues an error, as documented here: https://docs.nvidia.com/cuda/cuda-memcheck/index.html#synccheck-demo-illegal-syncwarpThe safe solution would be to always use the full mask (
0xffffffffu
) to synchronize the entire warp. I realize this may not fully take advantage of Volta's independent thread scheduling. However, if that were the goal I think the CUB API would have to expose themember_mask
somehow to allow the user to set it, so that it is possible to issue e.g. aballot_sync
outside CUB first, and then pass the member mask to CUB. As discussed here: https://devblogs.nvidia.com/using-cuda-warp-level-primitives/I will submit a pull request shortly for this solution.