Open publixsubfan opened 6 months ago
Check dynamic scratch allocations against the per-wave scratch limit in ROCR-Runtime, which is set to 8MB/warp. This is an increase in the original per-thread private segment limit from 16KiB to almost 128KiB.
A reproducer to observe this issue is here: https://gist.github.com/publixsubfan/f5fcfe9f3d826c45a80e23acf5d88de2
The ROCR-Runtime repo defines the following limit for per-wave scratch memory: https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/amd_gpu_agent.cpp#L84
#define MAX_WAVE_SCRATCH 8387584 // See COMPUTE_TMPRING_SIZE.WAVESIZE
Likewise, LLVM specifies the following scratch limits in this chunk of code here: https://github.com/ROCm/llvm-project/blob/d0f9aa6415cde2f7b9bc6dbf385b5c77b700edec/llvm/lib/Target/AMDGPU/GCNSubtarget.h#L308-L320
(2^13-1) * 1 KiB
MAX_WAVE_SCRATCH
(2^15-1) * 256 B
(2^18-1) * 256B
Check dynamic scratch allocations against the per-wave scratch limit in ROCR-Runtime, which is set to 8MB/warp. This is an increase in the original per-thread private segment limit from 16KiB to almost 128KiB.
A reproducer to observe this issue is here: https://gist.github.com/publixsubfan/f5fcfe9f3d826c45a80e23acf5d88de2
Details
The ROCR-Runtime repo defines the following limit for per-wave scratch memory: https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/amd_gpu_agent.cpp#L84
Likewise, LLVM specifies the following scratch limits in this chunk of code here: https://github.com/ROCm/llvm-project/blob/d0f9aa6415cde2f7b9bc6dbf385b5c77b700edec/llvm/lib/Target/AMDGPU/GCNSubtarget.h#L308-L320
(2^13-1) * 1 KiB
for GFX10 and below, this matches theMAX_WAVE_SCRATCH
value(2^15-1) * 256 B
for GFX11, this is a little larger thanMAX_WAVE_SCRATCH
(2^18-1) * 256B
for GFX12 and above, which is a little under 64MB/wave.