ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.69k stars 527 forks source link

kernel with reduced register usage according to __launch_bounds__ triggers HSA error #2062

Closed jglaser closed 5 months ago

jglaser commented 4 years ago

I am porting a CUDA kernel with __launch_bounds__ specifier to HIP. The idea is to template the kernel over a range of launch bounds, thus compiling several configurations of the kernel with different numbers of registers, and find the optimal one by auto-tuning.

However, I am unable to make it work on AMD hardware. HIP changes the semantics of the second argument of the launch bounds attribute from blocks per multiprocessor to active warps per execution unit according to the documentation. https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_kernel_language.md

However, looking at the implementation it looks like it's actually compatible with the CUDA one.

from include/hip/hcc_detail/hip_runtime.h:

#define launch_bounds_impl0(requiredMaxThreadsPerBlock)                                            \
    __attribute__((amdgpu_flat_work_group_size(1, requiredMaxThreadsPerBlock)))
#define launch_bounds_impl1(requiredMaxThreadsPerBlock, minBlocksPerMultiprocessor)                \
    __attribute__((amdgpu_flat_work_group_size(1, requiredMaxThreadsPerBlock),                     \
                   amdgpu_waves_per_eu(minBlocksPerMultiprocessor)))
#define select_impl_(_1, _2, impl_, ...) impl_
#define __launch_bounds__(...)                                                                     \
    select_impl_(__VA_ARGS__, launch_bounds_impl1, launch_bounds_impl0)(__VA_ARGS__)

So far so good, so I am simply leaving my CUDA code unchanged and it compiles. My kernel, without launch bounds, supports a maximum block size of 256 according to hipFuncGetAttributes(), maxThreadsPerBlock. With __launch_bounds__(1024) that changes to 1024 as expected. However, I am unable to run the kernel at the maximum block size then, I get:

### HCC STATUS_CHECK Error: Unknown Error Code (0x29) at file:mcwamp_hsa.cpp line:1226

Everything works as expected if I leave out the __launch_bounds__ attribute and clamp the block size down to 256 (or whatever hipFuncGetAttributes() returns).

Any ideas? I can point you to the source code (HOOMD-blue branch) if you need to reproduce, but I am afraid I can't come with a single kernel that is complex enough to reproduce.

jglaser commented 4 years ago

I was able to build the essential parts of the ROCm framework locally, including hcc. One offending kernel with __launch_bounds__ now fails to compile , which is perhaps somewhat more helpful than not running at all (which happened for another kernel with __launch_bounds__)

from the make output with hcc

LLVM ERROR: could not scavenge SGPR to spill in entry function
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.  Program arguments: /home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc -mtriple amdgcn-amd-amdhsa -mcpu=gfx906 -mattr=-code-object-v3 -O3 --frame-pointer=none -mattr=+sram-ecc -amdgpu-function-calls=0 -filetype=obj -o /home/users/coe0176/tmp/tmp.uBeLludxuR/kernel-gfx906.hsaco.isabin /home/users/coe0176/tmp/tmp.uBeLludxuR/kernel-gfx906.hsaco.opt.bc 
1.  Running pass 'CallGraph Pass Manager' on module '/home/users/coe0176/tmp/tmp.uBeLludxuR/kernel-gfx906.hsaco.opt.bc'.
2.  Running pass 'Prologue/Epilogue Insertion & Frame Finalization' on function '@_ZN4hpmc3gpu6kernel26clusters_insert_depletantsINS_21ShapeConvexPolyhedronELj1024EEEvPK15HIP_vector_typeIdLj4EES7_PKjb4vec3IdE4quatIdES9_S9_7Index2DS4_IjLj3EES4_IdLj3EE7Index3DjjS9_SE_jj6BoxDimPKNT_10param_typeEjjjSE_PjPS4_IjLj2EEjjjS9_'
 #0 0x000000000195302a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x195302a)
 #1 0x0000000001950e64 llvm::sys::RunSignalHandlers() (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x1950e64)
 #2 0x0000000001950f98 SignalHandler(int) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x1950f98)
 #3 0x00002aaaaacde5d0 __restore_rt (/lib64/libpthread.so.0+0xf5d0)
 #4 0x00002aaaab8592c7 raise (/lib64/libc.so.6+0x362c7)
 #5 0x00002aaaab85a9b8 abort (/lib64/libc.so.6+0x379b8)
 #6 0x00000000018d995f llvm::report_fatal_error(llvm::Twine const&, bool) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x18d995f)
 #7 0x00000000018d9a68 (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x18d9a68)
 #8 0x000000000083010a (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x83010a)
 #9 0x0000000000831de3 llvm::SIRegisterInfo::eliminateFrameIndex(llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>, int, unsigned int, llvm::RegScavenger*) const (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x831de3)
#10 0x0000000001060734 (anonymous namespace)::PEI::replaceFrameIndices(llvm::MachineBasicBlock*, llvm::MachineFunction&, int&) (.isra.248) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x1060734)
#11 0x0000000001067f88 (anonymous namespace)::PEI::runOnMachineFunction(llvm::MachineFunction&) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x1067f88)
#12 0x0000000000f7811d llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (.part.45) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0xf7811d)
#13 0x00000000012dfc97 llvm::FPPassManager::runOnFunction(llvm::Function&) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x12dfc97)
#14 0x0000000000c22b07 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0xc22b07)
#15 0x00000000012e06b8 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x12e06b8)
#16 0x0000000000689241 compileModule(char**, llvm::LLVMContext&) (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x689241)
#17 0x000000000061ff22 main (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x61ff22)
#18 0x00002aaaab845495 __libc_start_main (/lib64/libc.so.6+0x22495)
#19 0x0000000000683403 _start (/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/llc+0x683403)
/home/users/coe0176/miniconda3/envs/hip-master-hcc/bin/clamp-device: line 262: 49930 Aborted                 $LLC -mtriple amdgcn-amd-amdhsa -mcpu=$AMDGPU_TARGET $CODE_OBJECT_FORMAT $HCC_OPT $KMOPTLLC -amdgpu-function-calls=$AMDGPU_FUNC_CALLS -filetype=obj -o $2.isabin $2.opt.bc
Generating AMD GCN kernel failed in llc for target: gfx906
$ hcc --version
HCC clang version 11.0.0 (/home/users/coe0176/hcc/llvm-project/clang b4512f1f1817d745b0dddc05ec09728eb2593183) (based on HCC 3.1.20162-df05d4f-b4512f1 )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/users/coe0176/miniconda3/envs/hip-master-hcc/bin
ppanchad-amd commented 6 months ago

@jglaser Sorry for the lack of response. Please try latest ROCm 6.0.2 (HIP 6.0.32831) to see if your issue still exists? If resolved, please close the ticket. Thanks.