__shfl() operation lanes don't wrap around

__shfl() segfaults when being fed a negative lane id. While HIP on it's own right doesn't define what the behavior of __shfl() should be in this case, the CUDA docs do:

If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection).

HIP follows this behavior as it relays to __builtin_amdgcn_ds_bpermute described in the Vega ISA docs in the case of width == 64 under opcode 63:

// VGPR[index][thread_id] is the VGPR RAM // VDST, ADDR and DATA0 are from the microcode DS encoding tmp[0..63] = 0 for i in 0..63 do // ADDR needs to be divided by 4. // High-order bits are ignored. src_lane = floor((VGPR[ADDR][i] + OFFSET) / 4) mod 64

In practice, the code in question which triggered the segfault issued a __shfl() with negative source lane id but then discards the result (doesn't commit it to memory, but saves the branching for checking if said operation is valid in the first place). Because CUDA and HIP handle this is device code, I'd lean toward the code being valid (ugly nontheless).

HIP-CPU should take a stand and define some kind of behavior that doesn't segfault.

_(Note that the issue is slightly graver, because HIP in general doesn't define any behavior for these operations. According to the RDNA ISA docs p. 211, the wrapping behavior for the permute operations are still based on 64, even though the underlying arch uses 32 wide wavefronts. The underdefined nature of __shfl() in HIP, while following CUDA behavior when using GCN/CDNA derivates but deviating for RDNA derivates is going to bite hard in the future.)_

Whoops, segfaults are bad, this'll be fixed. Seems that you've also found a gap in HIP unit tests in general, as negative indices do not appear to be covered anywhere - nice!

I think that if we go with the way in which this is worded in CUDA we're left no better, as the remainder always carries the sign of the dividend, and thus srcLane module width will yield a negative quantity. That being said, a more workable (de facto, as opposed to de jure) definition appears to exist in https://www.informit.com/articles/article.aspx?p=2103809&seqNum=6, "[...]__shfl() returns the value of var held by the thread whose ID is given by srcLane. If srcLane is outside the range 0..width-1, the thread’s own value of var is returned". This also feels possibly more useful than treating a negative quantity as a really large unsigned quantity and essentially broadcasting the value in the highest active lane, which could be another interpretation of the CUDA Programming guide. I suspect that this is UB and it is not stated as such either accidentally or intentionally, but that's a different kettle of fish; if we're OK with the behaviour from the Handbook, I'd go with that one.

ROCm / HIP-CPU

__shfl() operation lanes don't wrap around #14