Closed MathiasMagnus closed 3 years ago
Whoops, segfaults are bad, this'll be fixed. Seems that you've also found a gap in HIP unit tests in general, as negative indices do not appear to be covered anywhere - nice!
I think that if we go with the way in which this is worded in CUDA we're left no better, as the remainder always carries the sign of the dividend, and thus srcLane module width will yield a negative quantity. That being said, a more workable (de facto, as opposed to de jure) definition appears to exist in https://www.informit.com/articles/article.aspx?p=2103809&seqNum=6, "[...]__shfl()
returns the value of var
held by the thread whose ID is given by srcLane
. If srcLane
is outside the range 0..width-1
, the thread’s own value of var
is returned". This also feels possibly more useful than treating a negative quantity as a really large unsigned quantity and essentially broadcasting the value in the highest active lane, which could be another interpretation of the CUDA Programming guide. I suspect that this is UB and it is not stated as such either accidentally or intentionally, but that's a different kettle of fish; if we're OK with the behaviour from the Handbook, I'd go with that one.
__shfl()
segfaults when being fed a negative lane id. While HIP on it's own right doesn't define what the behavior of__shfl()
should be in this case, the CUDA docs do:HIP follows this behavior as it relays to
__builtin_amdgcn_ds_bpermute
described in the Vega ISA docs in the case of width == 64 under opcode63
:In practice, the code in question which triggered the segfault issued a
__shfl()
with negative source lane id but then discards the result (doesn't commit it to memory, but saves the branching for checking if said operation is valid in the first place). Because CUDA and HIP handle this is device code, I'd lean toward the code being valid (ugly nontheless).HIP-CPU should take a stand and define some kind of behavior that doesn't segfault.
_(Note that the issue is slightly graver, because HIP in general doesn't define any behavior for these operations. According to the RDNA ISA docs p. 211, the wrapping behavior for the permute operations are still based on 64, even though the underlying arch uses 32 wide wavefronts. The underdefined nature of
__shfl()
in HIP, while following CUDA behavior when using GCN/CDNA derivates but deviating for RDNA derivates is going to bite hard in the future.)_