ROCm / ROCm-OpenCL-Runtime

ROCm OpenOpenCL Runtime
170 stars 60 forks source link

GPU fault detected in enqueue_kernel #149

Open kazuki opened 1 year ago

kazuki commented 1 year ago

Environment

  1. ThinkPad X13 Ryzen 7 6850U, Gentoo Linux, Linux 5.18.16/5.19.0 ROCm 5.0.2
  2. ThinkPad X13 Ryzen 7 6850U, Gentoo Linux, Linux 5.19.0 + docker rocm-terminal ROCm 5.2
  3. Threadripper 3970X + Radron RX560, 5.18.14/5.19.0, Gentoo LInux, ROCm 5.0.2

Code

__kernel void sub() {
}

__kernel void test() {
  enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_NO_WAIT, ndrange_1D(1), ^{
    sub();
  });  
}
import pyopencl as cl

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
with open("test.cl", 'r', encoding='utf8') as f:
    code = f.read()
prog = cl.Program(ctx, code).build(options="-cl-std=CL2.0 -save-temps")
prog.test(queue, [1], None)

Launch test kernel by enqueueNDRange, always SEGV raised in userspace application.

[ 8644.417555] Command Queue T[206540]: segfault at 18 ip 00007fe88abccae4 sp 00007fe87d7c1a90 error 4 in libamdocl64.so[7fe88ab10000+13f000]
[ 8644.417566] Code: 5c 41 5d 41 5e c3 48 8d 0d b1 a1 08 00 ba 53 01 00 00 48 8d 35 a5 9e 08 00 48 8d 3d d6 a1 08 00 e8 51 3b f4 ff 90 53 48 89 fb <48> 8b 7f 18 41 89 d1 48 85 ff 74 40 4c 8b 43 20 31 c9 31 c0 eb 11
[ 8644.417570] amdgpu 0000:21:00.0: amdgpu: GPU fault detected: 146 0x0000480c for process python pid 206474 thread python pid 206474
[ 8644.417576] amdgpu 0000:21:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 8644.417578] amdgpu 0000:21:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x1004800C
[ 8644.417579] amdgpu 0000:21:00.0: amdgpu: VM fault (0x0c, vmid 8, pasid 32773) at page 0, read from 'TC0' (0x54433000) (72)
[ 8644.417586] amdgpu 0000:21:00.0: amdgpu: GPU fault detected: 146 0x0000480c for process python pid 206474 thread python pid 206474
[ 8644.417588] amdgpu 0000:21:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 8644.417589] amdgpu 0000:21:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x11048014
[ 8644.417590] amdgpu 0000:21:00.0: amdgpu: VM fault (0x14, vmid 8, pasid 32773) at page 0, write from 'TC0' (0x54433000) (72)

(Above code is works in NVIDIA CUDA OpenCL runtime)