Crashes on device enumeration (clGetPlatformIDs)

inducer commented 6 years ago

With the following backtrace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f690e54f60a in ?? () from /opt/rocm/libhsakmt/lib/libhsakmt.so.1
#1  0x00007f690e553897 in hsaKmtAllocMemory () from /opt/rocm/libhsakmt/lib/libhsakmt.so.1
#2  0x00007f690e556fac in ?? () from /opt/rocm/libhsakmt/lib/libhsakmt.so.1
#3  0x00007f690e54d219 in hsaKmtCreateEvent () from /opt/rocm/libhsakmt/lib/libhsakmt.so.1
#4  0x00007f690e7a9ed3 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f690e7b3b78 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f690e7b58da in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#7  0x00007f690e797e45 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#8  0x00007f690e797ea4 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#9  0x00007f690e7b2d4e in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#10 0x00007f690e798dca in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#11 0x00007f690ef7a68d in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#12 0x00007f690ef5e633 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#13 0x00007f690ef5cd77 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#14 0x00007f690ef399d2 in clIcdGetPlatformIDsKHR () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#15 0x00007f691a82cf5b in khrIcdVendorAdd () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#16 0x00007f691a82eed7 in khrIcdOsVendorsEnumerate () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#17 0x00007f696fa55ea7 in __pthread_once_slow (once_control=0x7f691aa30b08, init_routine=0x7f691a82ed30 <khrIcdOsVendorsEnumerate>) at pthread_once.c:116
#18 0x00007f691a82d4f1 in clGetPlatformIDs () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#19 0x00007f691aa53c16 in _call_func<int (*)(unsigned int, _cl_platform_id**, unsigned int*), 0, 1, 2, int&, std::nullptr_t&, unsigned int*> (args=..., func=<optimized out>) at src/c_wrapper/function.h:38
#20 call_tuple<int (*&)(unsigned int, _cl_platform_id**, unsigned int*), std::tuple<int&, std::nullptr_t&, unsigned int*> > (args=<optimized out>, func=<synthetic pointer>: <optimized out>)
    at src/c_wrapper/function.h:49
#21 ArgPack<CLArg, int, decltype(nullptr), ArgBuffer<unsigned int, (ArgType)0> >::call<__CLArgGetter, int (*)(unsigned int, _cl_platform_id**, unsigned int*)>(int (*)(unsigned int, _cl_platform_id**, unsigned int*))
    (func=<optimized out>, this=<synthetic pointer>) at src/c_wrapper/function.h:110
#22 CLArgPack<int, decltype(nullptr), ArgBuffer<unsigned int, (ArgType)0> >::clcall<int (*)(unsigned int, _cl_platform_id**, unsigned int*)>(int (*)(unsigned int, _cl_platform_id**, unsigned int*), char const*) (
    name=0x7f691aa97bd2 "clGetPlatformIDs", func=<optimized out>, this=<synthetic pointer>) at src/c_wrapper/error.h:211
#23 call_guarded<int, std::nullptr_t, ArgBuffer<unsigned int, (ArgType)0>, unsigned int, _cl_platform_id**, unsigned int*> (name=0x7f691aa97bd2 "clGetPlatformIDs", func=<optimized out>) at src/c_wrapper/error.h:243
#24 <lambda()>::operator() (__closure=<synthetic pointer>) at src/c_wrapper/platform.cpp:63
#25 c_handle_error<get_platforms(clbase***, uint32_t*)::<lambda()> > (func=...) at src/c_wrapper/error.h:296
#26 get_platforms (_platforms=0x55cc6bf2b940, num_platforms=0x55cc6bf29fd0) at src/c_wrapper/platform.cpp:69
#27 0x00007f691aa4304d in _cffi_f_get_platforms (self=<optimized out>, args=<optimized out>) at build/temp.linux-x86_64-2.7/pyopencl._cffi.cpp:6684
#28 0x000055cc682e069a in PyEval_EvalFrameEx ()
#29 0x000055cc682ddc7a in PyEval_EvalCodeEx ()
#30 0x000055cc682e5db4 in PyEval_EvalFrameEx ()
(snip)

-- System Information:
Debian Release: buster/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'oldstable-updates'), (500, 'unstable'), (500, 'oldstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 4.13.0-1-amd64 (SMP w/32 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages rocm-opencl depends on:
ii  hsa-rocr-dev  1.1.8-15-ge851b7a

rocm-opencl recommends no packages.

rocm-opencl suggests no packages.

-- no debconf information

fxkamd commented 6 years ago

If you're logged in remotely, make sure your user account is in the "video" group. Otherwise you don't have access to required graphics driver device nodes. This is currently handled poorly in the Thunk.

inducer commented 6 years ago

Thanks for the heads-up. FWIW, I'm in the situation that I run a small cluster of scientific computing research machines, and this issue crashes unrelated code using MPI from users that don't use OpenCL directly. (because OpenMPI uses hwloc, and hwloc appears to enumerate OpenCL devices) I would like to offer AMD compute as a capability, but this makes it kind of hard.

shibe2 commented 5 years ago

I'm not sure if it's the same bug, but I have SIGSEGV on clGetPlatformIDs.

ROCm 2.3.0
Linux 5.0.9 with upstream amdgpu.
GCC 8.3.0

@fxkamd I straced the crashing program. It successfully opens /dev/kfd and /dev/dri/renderD128. There are no EACCES nor EPERM on any syscall. I conclude that this is unrelated to groups and permissions.

Stack trace:

0x0000000000000000
__gthread_create
std::thread::_M_start_thread
std::thread::thread<(anonymous namespace)::ThreadPoolExecutor::ThreadPoolExecutor(unsigned int)::<lambda()> >
(anonymous namespace)::ThreadPoolExecutor::ThreadPoolExecutor
(anonymous namespace)::Executor::getDefaultExecutor
llvm::parallel::detail::TaskGroup::spawn(std::function<void ()>)
llvm::parallel::detail::parallel_for_each<__gnu_cxx::__normal_iterator<lld::elf::InputSectionBase**, std::vector<lld::elf::InputSectionBase*, std::allocator<lld::elf::InputSectionBase*> > >, lld::elf::splitSections<llvm::object::ELFType<(llvm::support::endianness)1, true> >()::{lambda(lld::elf::InputSectionBase*)#1}>(__gnu_cxx::__normal_iterator<lld::elf::InputSectionBase**, std::vector<lld::elf::InputSectionBase*, std::allocator<lld::elf::InputSectionBase*> > >, lld::elf::splitSections<llvm::object::ELFType<(llvm::support::endianness)1, true> >()::{lambda(lld::elf::InputSectionBase*)#1}, lld::elf::splitSections<llvm::object::ELFType<(llvm::support::endianness)1, true> >()::{lambda(lld::elf::InputSectionBase*)#1})
llvm::parallel::for_each<__gnu_cxx::__normal_iterator<lld::elf::InputSectionBase**, std::vector<lld::elf::InputSectionBase*, std::allocator<lld::elf::InputSectionBase*> > >, lld::elf::splitSections<llvm::object::ELFType<(llvm::support::endianness)1, true> >()::{lambda(lld::elf::InputSectionBase*)#1}>(llvm::parallel::parallel_execution_policy, __gnu_cxx::__normal_iterator<lld::elf::InputSectionBase**, std::vector<lld::elf::InputSectionBase*, std::allocator<lld::elf::InputSectionBase*> > >, llvm::parallel::parallel_execution_policy, lld::elf::splitSections<llvm::object::ELFType<(llvm::support::endianness)1, true> >()::{lambda(lld::elf::InputSectionBase*)#1})
lld::parallelForEach<std::vector<lld::elf::InputSectionBase*, std::allocator<lld::elf::InputSectionBase*> >&, lld::elf::splitSections<llvm::object::ELFType<(llvm::support::endianness)1, true> >()::{lambda(lld::elf::InputSectionBase*)#1}>(std::vector<lld::elf::InputSectionBase*, std::allocator<lld::elf::InputSectionBase*> >&, lld::elf::splitSections<llvm::object::ELFType<(llvm::support::endianness)1, true> >()::{lambda(lld::elf::InputSectionBase*)#1})
lld::elf::splitSections<llvm::object::ELFType<(llvm::support::endianness)1, true> >
lld::elf::LinkerDriver::link<llvm::object::ELFType<(llvm::support::endianness)1, true> >
lld::elf::LinkerDriver::main
lld::elf::link
amd::opencl_driver::AMDGPUCompiler::CompileAndLinkExecutable
amd::opencl_driver::AMDGPUCompiler::CompileAndLinkExecutable
amd::CacheCompilation::compileAndLinkExecutable
device::Program::linkImplLC
device::Program::build
amd::Program::build
amd::Device::BlitProgram::create
roc::Device::create
roc::Device::init
amd::Device::init
amd::Runtime::init
clIcdGetPlatformIDsKHR
??
clGetPlatformIDs

It seems like linker calls to null pointer.

ROCm 2.0.0 works on the same system, so bisect may help. Although bisecting llvm is going to be painfully slow.

ROCm / ROCm-OpenCL-Runtime

Crashes on device enumeration (clGetPlatformIDs) #56