ROCm / clr

MIT License
85 stars 35 forks source link

Increase private segment limit for dynamic scratch kernels #80

Open publixsubfan opened 1 month ago

publixsubfan commented 1 month ago

Check dynamic scratch allocations against the per-wave scratch limit in ROCR-Runtime, which is set to 8MB/warp. This is an increase in the original per-thread private segment limit from 16KiB to almost 128KiB.

A reproducer to observe this issue is here: https://gist.github.com/publixsubfan/f5fcfe9f3d826c45a80e23acf5d88de2

Details

The ROCR-Runtime repo defines the following limit for per-wave scratch memory: https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/amd_gpu_agent.cpp#L84

#define MAX_WAVE_SCRATCH 8387584  // See COMPUTE_TMPRING_SIZE.WAVESIZE

Likewise, LLVM specifies the following scratch limits in this chunk of code here: https://github.com/ROCm/llvm-project/blob/d0f9aa6415cde2f7b9bc6dbf385b5c77b700edec/llvm/lib/Target/AMDGPU/GCNSubtarget.h#L308-L320