Increase private segment limit for dynamic scratch kernels

Check dynamic scratch allocations against the per-wave scratch limit in ROCR-Runtime, which is set to 8MB/warp. This is an increase in the original per-thread private segment limit from 16KiB to almost 128KiB.

A reproducer to observe this issue is here: https://gist.github.com/publixsubfan/f5fcfe9f3d826c45a80e23acf5d88de2

Details

The ROCR-Runtime repo defines the following limit for per-wave scratch memory: https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/amd_gpu_agent.cpp#L84

#define MAX_WAVE_SCRATCH 8387584  // See COMPUTE_TMPRING_SIZE.WAVESIZE

Likewise, LLVM specifies the following scratch limits in this chunk of code here: https://github.com/ROCm/llvm-project/blob/d0f9aa6415cde2f7b9bc6dbf385b5c77b700edec/llvm/lib/Target/AMDGPU/GCNSubtarget.h#L308-L320

(2^13-1) * 1 KiB for GFX10 and below, this matches the MAX_WAVE_SCRATCH value
(2^15-1) * 256 B for GFX11, this is a little larger than MAX_WAVE_SCRATCH
(2^18-1) * 256B for GFX12 and above, which is a little under 64MB/wave.

ROCm / clr

Increase private segment limit for dynamic scratch kernels #80

Details