ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.71k stars 528 forks source link

System freezes after error: 'hipErrorOutOfMemory'(2) at square.cpp:76 #2132

Closed devurandom closed 6 months ago

devurandom commented 4 years ago

System information

❯ inxi -GSC -xx
System:    Host: ernie Kernel: 5.7.9 x86_64 bits: 64 compiler: gcc v: 10.1.0 Desktop: N/A wm: kwin_x11 dm: SDDM 
           Distro: Gentoo Base System release 2.7 
CPU:       Topology: Quad Core model: AMD Ryzen 5 2400G with Radeon Vega Graphics bits: 64 type: MT MCP arch: Zen 
           L2 cache: 2048 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 57490 
           Speed: 1706 MHz min/max: 1600/3600 MHz Core speeds (MHz): 1: 1706 2: 2587 3: 3209 4: 1675 5: 1708 6: 3318 7: 2136 
           8: 1592 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] vendor: ASUSTeK 
           driver: amdgpu v: kernel bus ID: 01:00.0 chip ID: 1002:67ff 
           Device-2: AMD Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] vendor: ASUSTeK driver: amdgpu v: kernel 
           bus ID: 0a:00.0 chip ID: 1002:15dd 
           Display: server: X.Org 1.20.8 driver: amdgpu compositor: kwin_x11 resolution: 2560x1080~60Hz 
           OpenGL: renderer: AMD RAVEN (DRM 3.37.0 5.7.9 LLVM 10.0.0) v: 4.6 Mesa 20.1.3 direct render: Yes

Versions:

dev-libs/rocclr-3.5.0-r1
dev-libs/rocm-comgr-3.5.0
dev-libs/rocm-device-libs-3.5.1
dev-libs/rocm-opencl-runtime-3.5.0
dev-libs/rocr-runtime-3.5.0
dev-libs/roct-thunk-interface-3.6.0
dev-util/rocm-cmake-3.5.0
dev-util/rocminfo-3.5.0
sys-devel/llvm-roc-3.6.0

Problem

  1. I login into the node via SSH (because of the graphical system freeze, s.b.).
  2. I build the square example:
    ❯ make HIP_PATH=/usr HIPCC_VERBOSE=1
    /usr/bin/hipify-perl square.cu > square.cpp
    /usr/bin/hipcc  square.cpp -o square.out
    LoadLib(libhsa-ext-image64.so.1) failed: libhsa-ext-image64.so.1: cannot open shared object file: No such file or directory
    rocminfo: /tmp/portage/dev-libs/rocr-runtime-3.5.0/work/ROCR-Runtime-rocm-3.5.0/src/core/runtime/amd_memory_region.cpp:72: static void amd::MemoryRegion::FreeKfdMemory(void*, size_t): Assertion `status == HSAKMT_STATUS_SUCCESS' failed.
    Warning: The specified HIP target: gfx902 is unknown. Correct compilation is not guaranteed.
    hipcc-cmd: /usr/lib/llvm/roc/bin/clang++ -D__HIP_ROCclr__ -std=c++11 -isystem /usr/lib/llvm/roc/lib/clang/11.0.0/include/.. -D__HIP_ROCclr__ -D__HIP_ARCH_GFX902__=1  --cuda-gpu-arch=gfx902 -D__HIP_ARCH_GFX803__=1  --cuda-gpu-arch=gfx803 -O3 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false  --hip-device-lib-path=/usr/lib -fhip-new-launch-api  -L/usr/lib64 -O3 -lgcc_s -lgcc -lpthread -lm  -x hip square.cpp -o square.out -Wl,--enable-new-dtags -Wl,--rpath=/usr/lib64:/usr/lib -lhip_hcc 
  3. I execute the example:
    ❯ ./square.out 
    LoadLib(libhsa-ext-image64.so.1) failed: libhsa-ext-image64.so.1: cannot open shared object file: No such file or directory
    LoadLib(libhsa-amd-aqlprofile64.so) failed: libhsa-amd-aqlprofile64.so: cannot open shared object file: No such file or directory
    LoadLib(libhsa-amd-aqlprofile64.so) failed: libhsa-amd-aqlprofile64.so: cannot open shared object file: No such file or directory
    info: running on device AMD Ryzen 5 2400G with Radeon Vega Graphics
    info: allocate host mem (  7.63 MB)
    info: allocate device mem (  7.63 MB)
    error: 'hipErrorOutOfMemory'(2) at square.cpp:76

Afterwards my graphical system freezes and I need to REISUB.

This is reproducible every time I run ./square.out.

Regression

I never got HIP to work on this system. Still working on it. :)

Logs

Excerpts from the system journal of my last boot:

Jul 22 07:54:38 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 07:54:39 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 07:54:39 ernie kernel: [drm] VCE initialized successfully.
Jul 22 07:54:39 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 07:54:39 ernie kernel: Alloc host visible vram on small bar is not allowed
Jul 22 07:54:39 ernie systemd[1]: Started Process Core Dump (PID 1756750/UID 0).
Jul 22 07:54:39 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 07:54:39 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 07:54:39 ernie systemd-coredump[1756754]: Process 1756741 (rocminfo) of user 1000 dumped core.

                                                 Stack trace of thread 1756741:
                                                 #0  0x00007f71b44f2f91 raise (libc.so.6 + 0x38f91)
                                                 #1  0x00007f71b44dc537 abort (libc.so.6 + 0x22537)
                                                 #2  0x00007f71b44dc40f __assert_fail_base.cold (libc.so.6 + 0x2240f)
                                                 #3  0x00007f71b44eb3e2 __assert_fail (libc.so.6 + 0x313e2)
                                                 #4  0x00007f71b496c7d9 _ZN3amd12MemoryRegion13FreeKfdMemoryEPvm (libhsa-runtime64.so.1 + 0x4b7d9)
                                                 #5  0x00007f71b496d60d _ZNK3amd12MemoryRegion4FreeEPvm (libhsa-runtime64.so.1 + 0x4c60d)
                                                 #6  0x00007f71b49aff19 _ZN4core7Runtime10FreeMemoryEPv (libhsa-runtime64.so.1 + 0x8ef19)
                                                 #7  0x00007f71b49af568 _ZZN4core7Runtime13RegisterAgentEPNS_5AgentEENKUlPvE0_clES3_ (libhsa-runtime64.so.1 + 0x8e568)
                                                 #8  0x00007f71b49b7546 _ZSt13__invoke_implIvRZN4core7Runtime13RegisterAgentEPNS0_5AgentEEUlPvE0_JS4_EET_St14__invoke_otherOT0_DpOT1_ (libhsa-runtime64.so.1 + 0x96546)
                                                 #9  0x00007f71b49b725c _ZSt10__invoke_rIvRZN4core7Runtime13RegisterAgentEPNS0_5AgentEEUlPvE0_JS4_EENSt9enable_ifIXsrSt6__and_IJSt7is_voidIT_ESt14__is_invocableIT0_JDpT1_EEEE5valueESA_E4typeEOSD_DpOSE_ (libhsa-runtime64.so.1 + 0x9625c)
                                                 #10 0x00007f71b49b6d65 _ZNSt17_Function_handlerIFvPvEZN4core7Runtime13RegisterAgentEPNS2_5AgentEEUlS0_E0_E9_M_invokeERKSt9_Any_dataOS0_ (libhsa-runtime64.so.1 + 0x95d65)
                                                 #11 0x00007f71b4940087 _ZNKSt8functionIFvPvEEclES0_ (libhsa-runtime64.so.1 + 0x1f087)
                                                 #12 0x00007f71b4955454 _ZNK3amd8GpuAgent13ReleaseShaderEPvm (libhsa-runtime64.so.1 + 0x34454)
                                                 #13 0x00007f71b49547cb _ZN3amd8GpuAgentD2Ev (libhsa-runtime64.so.1 + 0x337cb)
                                                 #14 0x00007f71b4954960 _ZN3amd8GpuAgentD0Ev (libhsa-runtime64.so.1 + 0x33960)
                                                 #15 0x00007f71b49bd764 _ZNK12DeleteObjectclIN4core5AgentEEEvPKT_ (libhsa-runtime64.so.1 + 0x9c764)
                                                 #16 0x00007f71b49ba6a4 _ZSt8for_eachIN9__gnu_cxx17__normal_iteratorIPPN4core5AgentESt6vectorIS4_SaIS4_EEEE12DeleteObjectET0_T_SC_SB_ (libhsa-runtime64.so.1 + 0x996a4)
                                                 #17 0x00007f71b49b4c83 _ZN4core7Runtime6UnloadEv (libhsa-runtime64.so.1 + 0x93c83)
                                                 #18 0x00007f71b49af3a3 _ZN4core7Runtime7ReleaseEv (libhsa-runtime64.so.1 + 0x8e3a3)
                                                 #19 0x00007f71b4987452 _ZN3HSA13hsa_shut_downEv (libhsa-runtime64.so.1 + 0x66452)
                                                 #20 0x00007f71b49d1f92 hsa_shut_down (libhsa-runtime64.so.1 + 0xb0f92)
                                                 #21 0x00005620d9aaa931 main (rocminfo + 0x8931)
                                                 #22 0x00007f71b44ddcaa __libc_start_main (libc.so.6 + 0x23caa)
                                                 #23 0x00005620d9aa40ba _start (rocminfo + 0x20ba)

                                                 Stack trace of thread 1756749:
                                                 #0  0x00007f71b45af957 ioctl (libc.so.6 + 0xf5957)
                                                 #1  0x00007f71b447f800 kmtIoctl (libhsakmt.so.1 + 0xb800)
                                                 #2  0x00007f71b447991d hsaKmtWaitOnMultipleEvents (libhsakmt.so.1 + 0x591d)
                                                 #3  0x00007f71b49cba66 _ZN4core6Signal7WaitAnyEjPK12hsa_signal_sPK22hsa_signal_condition_tPKlm16hsa_wait_state_tPl (libhsa-runtime64.so.1 + 0xaaa66)
                                                 #4  0x00007f71b49972fa _ZN3AMD23hsa_amd_signal_wait_anyEjP12hsa_signal_sP22hsa_signal_condition_tPlm16hsa_wait_state_tS4_ (libhsa-runtime64.so.1 + 0x762fa)
                                                 #5  0x00007f71b49b3286 _ZN4core7Runtime15AsyncEventsLoopEPv (libhsa-runtime64.so.1 + 0x92286)
                                                 #6  0x00007f71b4936597 _ZN2os16ThreadTrampolineEPv (libhsa-runtime64.so.1 + 0x15597)
                                                 #7  0x00007f71b4688fea start_thread (libpthread.so.0 + 0x7fea)
                                                 #8  0x00007f71b45b8edf __clone (libc.so.6 + 0xfeedf)
Jul 22 08:00:23 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 08:00:23 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 08:00:23 ernie kernel: [drm] VCE initialized successfully.
Jul 22 08:00:23 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 08:00:23 ernie kernel: Alloc host visible vram on small bar is not allowed
Jul 22 08:00:23 ernie systemd[1]: Started Process Core Dump (PID 1764207/UID 0).
Jul 22 08:00:24 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 08:00:24 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 08:00:24 ernie systemd-coredump[1764209]: Process 1764173 (rocminfo) of user 1000 dumped core.

                                                 Stack trace of thread 1764173:
                                                 #0  0x00007f6d248aef91 raise (libc.so.6 + 0x38f91)
                                                 #1  0x00007f6d24898537 abort (libc.so.6 + 0x22537)
                                                 #2  0x00007f6d2489840f __assert_fail_base.cold (libc.so.6 + 0x2240f)
                                                 #3  0x00007f6d248a73e2 __assert_fail (libc.so.6 + 0x313e2)
                                                 #4  0x00007f6d24d287d9 _ZN3amd12MemoryRegion13FreeKfdMemoryEPvm (libhsa-runtime64.so.1 + 0x4b7d9)
                                                 #5  0x00007f6d24d2960d _ZNK3amd12MemoryRegion4FreeEPvm (libhsa-runtime64.so.1 + 0x4c60d)
                                                 #6  0x00007f6d24d6bf19 _ZN4core7Runtime10FreeMemoryEPv (libhsa-runtime64.so.1 + 0x8ef19)
                                                 #7  0x00007f6d24d6b568 _ZZN4core7Runtime13RegisterAgentEPNS_5AgentEENKUlPvE0_clES3_ (libhsa-runtime64.so.1 + 0x8e568)
                                                 #8  0x00007f6d24d73546 _ZSt13__invoke_implIvRZN4core7Runtime13RegisterAgentEPNS0_5AgentEEUlPvE0_JS4_EET_St14__invoke_otherOT0_DpOT1_ (libhsa-runtime64.so.1 + 0x96546)
                                                 #9  0x00007f6d24d7325c _ZSt10__invoke_rIvRZN4core7Runtime13RegisterAgentEPNS0_5AgentEEUlPvE0_JS4_EENSt9enable_ifIXsrSt6__and_IJSt7is_voidIT_ESt14__is_invocableIT0_JDpT1_EEEE5valueESA_E4typeEOSD_DpOSE_ (libhsa-runtime64.so.1 + 0x9625c)
                                                 #10 0x00007f6d24d72d65 _ZNSt17_Function_handlerIFvPvEZN4core7Runtime13RegisterAgentEPNS2_5AgentEEUlS0_E0_E9_M_invokeERKSt9_Any_dataOS0_ (libhsa-runtime64.so.1 + 0x95d65)
                                                 #11 0x00007f6d24cfc087 _ZNKSt8functionIFvPvEEclES0_ (libhsa-runtime64.so.1 + 0x1f087)
                                                 #12 0x00007f6d24d11454 _ZNK3amd8GpuAgent13ReleaseShaderEPvm (libhsa-runtime64.so.1 + 0x34454)
                                                 #13 0x00007f6d24d107cb _ZN3amd8GpuAgentD2Ev (libhsa-runtime64.so.1 + 0x337cb)
                                                 #14 0x00007f6d24d10960 _ZN3amd8GpuAgentD0Ev (libhsa-runtime64.so.1 + 0x33960)
                                                 #15 0x00007f6d24d79764 _ZNK12DeleteObjectclIN4core5AgentEEEvPKT_ (libhsa-runtime64.so.1 + 0x9c764)
                                                 #16 0x00007f6d24d766a4 _ZSt8for_eachIN9__gnu_cxx17__normal_iteratorIPPN4core5AgentESt6vectorIS4_SaIS4_EEEE12DeleteObjectET0_T_SC_SB_ (libhsa-runtime64.so.1 + 0x996a4)
                                                 #17 0x00007f6d24d70c83 _ZN4core7Runtime6UnloadEv (libhsa-runtime64.so.1 + 0x93c83)
                                                 #18 0x00007f6d24d6b3a3 _ZN4core7Runtime7ReleaseEv (libhsa-runtime64.so.1 + 0x8e3a3)
                                                 #19 0x00007f6d24d43452 _ZN3HSA13hsa_shut_downEv (libhsa-runtime64.so.1 + 0x66452)
                                                 #20 0x00007f6d24d8df92 hsa_shut_down (libhsa-runtime64.so.1 + 0xb0f92)
                                                 #21 0x00005595cf192931 main (rocminfo + 0x8931)
                                                 #22 0x00007f6d24899caa __libc_start_main (libc.so.6 + 0x23caa)
                                                 #23 0x00005595cf18c0ba _start (rocminfo + 0x20ba)

                                                 Stack trace of thread 1764206:
                                                 #0  0x00007f6d2496b957 ioctl (libc.so.6 + 0xf5957)
                                                 #1  0x00007f6d2483b800 kmtIoctl (libhsakmt.so.1 + 0xb800)
                                                 #2  0x00007f6d2483591d hsaKmtWaitOnMultipleEvents (libhsakmt.so.1 + 0x591d)
                                                 #3  0x00007f6d24d87a66 _ZN4core6Signal7WaitAnyEjPK12hsa_signal_sPK22hsa_signal_condition_tPKlm16hsa_wait_state_tPl (libhsa-runtime64.so.1 + 0xaaa66)
                                                 #4  0x00007f6d24d532fa _ZN3AMD23hsa_amd_signal_wait_anyEjP12hsa_signal_sP22hsa_signal_condition_tPlm16hsa_wait_state_tS4_ (libhsa-runtime64.so.1 + 0x762fa)
                                                 #5  0x00007f6d24d6f286 _ZN4core7Runtime15AsyncEventsLoopEPv (libhsa-runtime64.so.1 + 0x92286)
                                                 #6  0x00007f6d24cf2597 _ZN2os16ThreadTrampolineEPv (libhsa-runtime64.so.1 + 0x15597)
                                                 #7  0x00007f6d24a44fea start_thread (libpthread.so.0 + 0x7fea)
                                                 #8  0x00007f6d24974edf __clone (libc.so.6 + 0xfeedf)
Jul 22 08:00:24 ernie systemd[1]: systemd-coredump@21-1764207-0.service: Succeeded.
Jul 22 08:01:00 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 08:01:00 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 08:01:00 ernie kernel: [drm] VCE initialized successfully.
Jul 22 08:01:00 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 08:01:00 ernie kernel: Alloc host visible vram on small bar is not allowed
Jul 22 08:01:00 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 08:01:00 ernie kernel: Evicting PASID 0x8026 queues

Afterwards the system was running for a while without me interacting with it. When I came back, I couldn't access my X11 session anymore (system not reacting to keyboard input, like NumLock, switching to VT not possible), so I had to REISUB:

Jul 22 08:52:38 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 08:52:38 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 08:52:38 ernie kernel: [drm] VCE initialized successfully.
Jul 22 08:52:38 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 08:52:49 ernie kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:56:crtc-0] flip_done timed out
Jul 22 08:52:59 ernie kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:49:plane-3] flip_done timed out
Jul 22 08:53:40 ernie kernel: sysrq: Keyboard mode set to system default
Jul 22 08:53:41 ernie kernel: sysrq: Terminate All Tasks
Jul 22 08:53:41 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 08:53:41 ernie kernel: bpfilter: Loaded bpfilter_umh pid 1811427
Jul 22 08:53:41 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 08:53:41 ernie kernel: [drm] VCE initialized successfully.
Jul 22 08:53:41 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 08:53:42 ernie kernel: sysrq: Kill All Tasks
Jul 22 08:53:42 ernie kernel: ------------[ cut here ]------------
Jul 22 08:53:42 ernie kernel: WARNING: CPU: 6 PID: 1430 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:6787 amdgpu_dm_atomic_commit_tail+0x20bd/0x2230 [amdgpu]
Jul 22 08:53:42 ernie kernel: Modules linked in: squashfs loop snd_seq_dummy snd_hrtimer snd_seq fuse nft_masq nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject >
Jul 22 08:53:42 ernie kernel:  snd_hwdep gspca_vc032x uvcvideo gspca_main kvm asus_wmi ecdh_generic cmac videobuf2_vmalloc md4 videobuf2_memops amd_iommu_v2 battery gpu_sched videobuf2_v4l2 ecc crc16 videobuf2_common irqbypass sparse_keymap ttm snd_pcm pcspkr rfkill videodev wmi_bmof sp5100_tco k10temp i2c_piix4 drm_kms_helper joydev mc snd_timer mousedev input>
Jul 22 08:53:42 ernie kernel:  pkcs8_key_parser
Jul 22 08:53:42 ernie kernel: CPU: 6 PID: 1430 Comm: X:sh5 Tainted: G                T 5.7.9 #2
Jul 22 08:53:42 ernie kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B350-F GAMING, BIOS 5406 11/13/2019
Jul 22 08:53:42 ernie kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x20bd/0x2230 [amdgpu]
Jul 22 08:53:42 ernie kernel: Code: ff ff 41 8b 4c 24 60 48 c7 c2 60 26 2b c1 bf 02 00 00 00 48 c7 c6 80 81 32 c1 e8 5e 2f 4a ff 49 8b 4f 08 e9 bd e0 ff ff 0f 0b <0f> 0b e9 b0 ef ff ff 0f 0b e9 c9 ef ff ff 48 8b 85 68 fd ff ff 48
Jul 22 08:53:42 ernie kernel: RSP: 0018:ffffad3302d83870 EFLAGS: 00010002
Jul 22 08:53:42 ernie kernel: RAX: 0000000000000286 RBX: 0000000000000003 RCX: 0000000000000000
Jul 22 08:53:42 ernie kernel: RDX: 0000000000000002 RSI: 0000000000000202 RDI: 0000000000000000
Jul 22 08:53:42 ernie kernel: RBP: ffffad3302d83b60 R08: 0000000000000005 R09: 0000000000000000
Jul 22 08:53:42 ernie kernel: R10: ffffad3302d837d8 R11: ffffad3302d837dc R12: 0000000000000286
Jul 22 08:53:42 ernie kernel: R13: ffff9308d9249800 R14: ffff930651c83800 R15: ffff9308e4953080
Jul 22 08:53:42 ernie kernel: FS:  00007f919effd700(0000) GS:ffff9308f0780000(0000) knlGS:0000000000000000
Jul 22 08:53:42 ernie kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 22 08:53:42 ernie kernel: CR2: 0000556e92a96828 CR3: 0000000073a0a000 CR4: 00000000003406e0
Jul 22 08:53:42 ernie kernel: Call Trace:
Jul 22 08:53:42 ernie kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel:  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel:  drm_client_modeset_commit_atomic+0x1c9/0x200 [drm]
Jul 22 08:53:42 ernie kernel:  drm_client_modeset_commit_locked+0x54/0x150 [drm]
Jul 22 08:53:42 ernie kernel:  drm_client_modeset_commit+0x24/0x40 [drm]
Jul 22 08:53:42 ernie kernel:  drm_fb_helper_set_par+0xa5/0xd0 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel:  drm_fb_helper_hotplug_event.part.0+0xa3/0xc0 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel:  amdgpu_driver_lastclose_kms+0xa/0x10 [amdgpu]
Jul 22 08:53:42 ernie kernel:  drm_release+0xd2/0x100 [drm]
Jul 22 08:53:42 ernie kernel:  __fput+0xe5/0x250
Jul 22 08:53:42 ernie kernel:  task_work_run+0x5f/0x80
Jul 22 08:53:42 ernie kernel:  do_exit+0x363/0xb40
Jul 22 08:53:42 ernie kernel:  do_group_exit+0x36/0xa0
Jul 22 08:53:42 ernie kernel:  get_signal+0x148/0x920
Jul 22 08:53:42 ernie kernel:  ? __handle_mm_fault+0xe54/0x18f0
Jul 22 08:53:42 ernie kernel:  do_signal+0x3d/0x720
Jul 22 08:53:42 ernie kernel:  ? preempt_count_add+0x49/0xa0
Jul 22 08:53:42 ernie kernel:  prepare_exit_to_usermode+0xf2/0x170
Jul 22 08:53:42 ernie kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 22 08:53:42 ernie kernel: RIP: 0033:0x7f91b6011ad5
Jul 22 08:53:42 ernie kernel: Code: Bad RIP value.
Jul 22 08:53:42 ernie kernel: RSP: 002b:00007f919effcae0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Jul 22 08:53:42 ernie kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f91b6011ad5
Jul 22 08:53:42 ernie kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000055f789922a24
Jul 22 08:53:42 ernie kernel: RBP: 000055f7899229f8 R08: 0000000000000000 R09: 0000000000000000
Jul 22 08:53:42 ernie kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f919effcb10
Jul 22 08:53:42 ernie kernel: R13: 000055f7899229d0 R14: 0000000000000001 R15: 000055f789922a24
Jul 22 08:53:42 ernie kernel: ---[ end trace 04201852eb3a754f ]---
Jul 22 08:53:42 ernie kernel: ------------[ cut here ]------------
Jul 22 08:53:42 ernie kernel: WARNING: CPU: 6 PID: 1430 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:6389 amdgpu_dm_atomic_commit_tail+0x20c4/0x2230 [amdgpu]
Jul 22 08:53:42 ernie kernel: Modules linked in: squashfs loop snd_seq_dummy snd_hrtimer snd_seq fuse nft_masq nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject >
Jul 22 08:53:42 ernie kernel:  snd_hwdep gspca_vc032x uvcvideo gspca_main kvm asus_wmi ecdh_generic cmac videobuf2_vmalloc md4 videobuf2_memops amd_iommu_v2 battery gpu_sched videobuf2_v4l2 ecc crc16 videobuf2_common irqbypass sparse_keymap ttm snd_pcm pcspkr rfkill videodev wmi_bmof sp5100_tco k10temp i2c_piix4 drm_kms_helper joydev mc snd_timer mousedev input>
Jul 22 08:53:42 ernie kernel:  pkcs8_key_parser
Jul 22 08:53:42 ernie kernel: CPU: 6 PID: 1430 Comm: X:sh5 Tainted: G        W       T 5.7.9 #2
Jul 22 08:53:42 ernie kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B350-F GAMING, BIOS 5406 11/13/2019
Jul 22 08:53:42 ernie kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x20c4/0x2230 [amdgpu]
Jul 22 08:53:42 ernie kernel: Code: 48 c7 c2 60 26 2b c1 bf 02 00 00 00 48 c7 c6 80 81 32 c1 e8 5e 2f 4a ff 49 8b 4f 08 e9 bd e0 ff ff 0f 0b 0f 0b e9 b0 ef ff ff <0f> 0b e9 c9 ef ff ff 48 8b 85 68 fd ff ff 48 8d 8d e0 fd ff ff 48
Jul 22 08:53:42 ernie kernel: RSP: 0018:ffffad3302d83870 EFLAGS: 00010082
Jul 22 08:53:42 ernie kernel: RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000000
Jul 22 08:53:42 ernie kernel: RDX: 0000000000000002 RSI: 0000000000000202 RDI: 0000000000000000
Jul 22 08:53:42 ernie kernel: RBP: ffffad3302d83b60 R08: 0000000000000005 R09: 0000000000000000
Jul 22 08:53:42 ernie kernel: R10: ffffad3302d837d8 R11: ffffad3302d837dc R12: 0000000000000286
Jul 22 08:53:42 ernie kernel: R13: ffff9308d9249800 R14: ffff930651c83800 R15: ffff9308e4953080
Jul 22 08:53:42 ernie kernel: FS:  00007f919effd700(0000) GS:ffff9308f0780000(0000) knlGS:0000000000000000
Jul 22 08:53:42 ernie kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 22 08:53:42 ernie kernel: CR2: 00007f91b6011aab CR3: 0000000073a0a000 CR4: 00000000003406e0
Jul 22 08:53:42 ernie kernel: Call Trace:
Jul 22 08:53:42 ernie kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel:  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel:  drm_client_modeset_commit_atomic+0x1c9/0x200 [drm]
Jul 22 08:53:42 ernie kernel:  drm_client_modeset_commit_locked+0x54/0x150 [drm]
Jul 22 08:53:42 ernie kernel:  drm_client_modeset_commit+0x24/0x40 [drm]
Jul 22 08:53:42 ernie kernel:  drm_fb_helper_set_par+0xa5/0xd0 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel:  drm_fb_helper_hotplug_event.part.0+0xa3/0xc0 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel:  amdgpu_driver_lastclose_kms+0xa/0x10 [amdgpu]
Jul 22 08:53:42 ernie kernel:  drm_release+0xd2/0x100 [drm]
Jul 22 08:53:42 ernie kernel:  __fput+0xe5/0x250
Jul 22 08:53:42 ernie kernel:  task_work_run+0x5f/0x80
Jul 22 08:53:42 ernie kernel:  do_exit+0x363/0xb40
Jul 22 08:53:42 ernie kernel:  do_group_exit+0x36/0xa0
Jul 22 08:53:42 ernie kernel:  get_signal+0x148/0x920
Jul 22 08:53:42 ernie kernel:  ? __handle_mm_fault+0xe54/0x18f0
Jul 22 08:53:42 ernie kernel:  do_signal+0x3d/0x720
Jul 22 08:53:42 ernie kernel:  ? preempt_count_add+0x49/0xa0
Jul 22 08:53:42 ernie kernel:  prepare_exit_to_usermode+0xf2/0x170
Jul 22 08:53:42 ernie kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 22 08:53:42 ernie kernel: RIP: 0033:0x7f91b6011ad5
Jul 22 08:53:42 ernie kernel: Code: Bad RIP value.
Jul 22 08:53:42 ernie kernel: RSP: 002b:00007f919effcae0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Jul 22 08:53:42 ernie kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f91b6011ad5
Jul 22 08:53:42 ernie kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000055f789922a24
Jul 22 08:53:42 ernie kernel: RBP: 000055f7899229f8 R08: 0000000000000000 R09: 0000000000000000
Jul 22 08:53:42 ernie kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f919effcb10
Jul 22 08:53:42 ernie kernel: R13: 000055f7899229d0 R14: 0000000000000001 R15: 000055f789922a24
Jul 22 08:53:42 ernie kernel: ---[ end trace 04201852eb3a7550 ]---

Other information

I also see exceptions and segfaults in Clover and ROCm's OpenCL implementation when executing clinfo and rocminfo:

I also see the system hanging in a very similar manner to this one when trying to use OpenCL from the JVM (running the Neanderthal examples), but since that is a lot more high level, I do not have a useful MWE for that. When trying this, I also regularly encountered OpenCL "out of memory" errors.

ghost commented 2 years ago

When trying this, I also regularly encountered OpenCL "out of memory" errors.

I encountered a similar error on ROCm 4.5.2. The first time I encountered a system freeze which appeared to be a result of running out of RAM (32 GBs)! After that whenever I try to run I just get hipErrorOutOfMemory

Maybe I need to try downgrading?

hipconfig:

 HIP version  : 4.4.21432-f9dccde4

== hipconfig
HIP_PATH     : /opt/rocm-4.5.2/hip
ROCM_PATH    : /opt/rocm-4.5.2
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-4.5.2/hip/include -I/opt/rocm-4.5.2/llvm/bin/../lib/clang/13.0.0 -I/opt/rocm-4.5.2/hsa/include

== hip-clang
HSA_PATH         : /opt/rocm-4.5.2/hsa
HIP_CLANG_PATH   : /opt/rocm-4.5.2/llvm/bin
AMD clang version 13.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-4.5.2 21432 9bbd96fd1936641cd47defd8022edafd063019d5)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.5.2/llvm/bin
AMD LLVM version 13.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver2

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags :  -std=c++11 -isystem "/opt/rocm-4.5.2/llvm/lib/clang/13.0.0/include/.." -isystem /opt/rocm-4.5.2/hsa/include -isystem "/opt/rocm-4.5.2/hip/include" -O3
hip-clang-ldflags  : --driver-mode=g++ -L"/opt/rocm-4.5.2/hip/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt

=== Environment Variables
PATH=/home/user1/.vscode-server/bin/fe719cd3e5825bf14e14182fddeb88ee8daf044f/bin:/home/user1/.vscode-server/bin/fe719cd3e5825bf14e14182fddeb88ee8daf044f/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

== Linux Kernel
Hostname     : roxane
Linux roxane 5.10.0-1052-oem #54-Ubuntu SMP Tue Nov 23 09:06:13 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
ppanchad-amd commented 6 months ago

@devurandom, Sorry for the lack of response. Please try latest ROCm 6.0.2 (HIP 6.0.32831) to see if your issue still exists? If resolved, please close the ticket. Thanks.

devurandom commented 6 months ago

Sorry, this has been too long and I no longer have access to that system.