ROCm / ROCT-Thunk-Interface

ROCm's Thunk Interface
Other
83 stars 71 forks source link

System crash with KFDMemoryTest.LargestSysBufferTest #76

Closed tpkessler closed 1 month ago

tpkessler commented 2 years ago

Hi! I'm the main contributor to a community-driven effort to build ROCm for Arch Linux (https://github.com/rocm-arch/rocm-arch). Recently, we've started to add the tests provided in the repos to our package building scheme. Running the KFD tests I experience a system crash at KFDMemoryTest.LargestSysBufferTest. All applications are killed and after a small break with a black screen I'm greeted by the login manager.

First the cmake output

``` -- The C compiler identification is GNU 12.2.0 -- The CXX compiler identification is GNU 12.2.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found PkgConfig: /usr/bin/pkg-config (found version "1.8.0") -- Detected distribution: arch: -- LIBC:/usr/lib/libc.so.6 -- NUMA:/usr/lib/libnuma.so -- Checking for module 'libdrm' -- Found libdrm, version 2.4.114 -- Checking for module 'libdrm_amdgpu' -- Found libdrm_amdgpu, version 2.4.114 -- LIBGCC:/usr/lib/libgcc_s.so.1 -- Configuring done -- Generating done -- Build files have been written to: /home/torsten/Dokumente/rocm-arch/hsakmt-roct/src/build [ 5%] Building C object CMakeFiles/hsakmt.dir/src/debug.c.o [ 17%] Building C object CMakeFiles/hsakmt.dir/src/pmc_table.c.o [ 17%] Building C object CMakeFiles/hsakmt.dir/src/memory.c.o [ 29%] Building C object CMakeFiles/hsakmt.dir/src/libhsakmt.c.o [ 29%] Building C object CMakeFiles/hsakmt.dir/src/globals.c.o [ 35%] Building C object CMakeFiles/hsakmt.dir/src/queues.c.o [ 41%] Building C object CMakeFiles/hsakmt.dir/src/topology.c.o [ 47%] Building C object CMakeFiles/hsakmt.dir/src/time.c.o [ 52%] Building C object CMakeFiles/hsakmt.dir/src/events.c.o [ 58%] Building C object CMakeFiles/hsakmt.dir/src/fmm.c.o [ 64%] Building C object CMakeFiles/hsakmt.dir/src/openclose.c.o [ 70%] Building C object CMakeFiles/hsakmt.dir/src/perfctr.c.o [ 82%] Building C object CMakeFiles/hsakmt.dir/src/rbtree.c.o [ 82%] Building C object CMakeFiles/hsakmt.dir/src/spm.c.o [ 88%] Building C object CMakeFiles/hsakmt.dir/src/version.c.o [ 94%] Building C object CMakeFiles/hsakmt.dir/src/svm.c.o [100%] Linking C shared library libhsakmt.so [100%] Built target hsakmt ==> Beginne check()... -- Install configuration: "None" -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/lib/libhsakmt.so.1.0.6 -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/lib/libhsakmt.so.1 -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/lib/libhsakmt.so -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/include/hsakmt -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/include/hsakmt/hsakmt.h -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/include/hsakmt/hsakmttypes.h -- Up-to-date: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/./include -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/./include/hsakmt.h -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/./include/hsakmttypes.h -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/lib/cmake/hsakmt/hsakmtTargets.cmake -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/lib/cmake/hsakmt/hsakmtTargets-none.cmake -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/lib/cmake/hsakmt/hsakmt-config.cmake -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/lib/cmake/hsakmt/hsakmt-config-version.cmake -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/share/pkgconfig/libhsakmt.pc -- Installing: ROCT-Thunk-Interface-rocm-5.3.2/tmp.J2lIsOqB87/opt/rocm/share/doc/hsakmt/LICENSE.md -- The C compiler identification is GNU 12.2.0 -- The CXX compiler identification is GNU 12.2.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found PkgConfig: /usr/bin/pkg-config (found version "1.8.0") -- Checking for module 'libdrm' -- Found libdrm, version 2.4.114 -- Checking for module 'libdrm_amdgpu' -- Found libdrm_amdgpu, version 2.4.114 -- Couldn't find Lightning build in compute directory. Searching LLVM_DIR then defaulting to system LLVM install if still not found... -- Performing Test Terminfo_LINKABLE -- Performing Test Terminfo_LINKABLE - Success -- Found Terminfo: /usr/lib/libtinfo.so -- Found ZLIB: /usr/lib/libz.so (found version "1.2.13") -- Found LibXml2: /usr/lib/libxml2.so (found version "2.10.3") -- Found LLVM 15.0.0git -- Using LLVMConfig.cmake in: /opt/rocm/llvm/lib/cmake/llvm -- PROJECT_SOURCE_DIR:/home/torsten/Dokumente/rocm-arch/hsakmt-roct/src/ROCT-Thunk-Interface-rocm-5.3.2/tests/kfdtest -- Configuring done -- Generating done -- Build files have been written to: /home/torsten/Dokumente/rocm-arch/hsakmt-roct/src/kfd-build [ 2%] Building CXX object CMakeFiles/kfdtest.dir/gtest-1.6.0/gtest-all.cpp.o [ 4%] Building CXX object CMakeFiles/kfdtest.dir/src/AqlQueue.cpp.o [ 7%] Building CXX object CMakeFiles/kfdtest.dir/src/BasePacket.cpp.o [ 9%] Building CXX object CMakeFiles/kfdtest.dir/src/BaseQueue.cpp.o [ 11%] Building CXX object CMakeFiles/kfdtest.dir/src/IndirectBuffer.cpp.o [ 16%] Building CXX object CMakeFiles/kfdtest.dir/src/GoogleTestExtension.cpp.o [ 16%] Building CXX object CMakeFiles/kfdtest.dir/src/PM4Queue.cpp.o [ 21%] Building CXX object CMakeFiles/kfdtest.dir/src/LinuxOSWrapper.cpp.o [ 21%] Building CXX object CMakeFiles/kfdtest.dir/src/ShaderStore.cpp.o [ 23%] Building CXX object CMakeFiles/kfdtest.dir/src/Assemble.cpp.o [ 26%] Building CXX object CMakeFiles/kfdtest.dir/src/Dispatch.cpp.o [ 30%] Building CXX object CMakeFiles/kfdtest.dir/src/PM4Packet.cpp.o [ 30%] Building CXX object CMakeFiles/kfdtest.dir/src/RDMAUtil.cpp.o [ 33%] Building CXX object CMakeFiles/kfdtest.dir/src/SDMAPacket.cpp.o [ 35%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDBaseComponentTest.cpp.o [ 40%] Building CXX object CMakeFiles/kfdtest.dir/src/SDMAQueue.cpp.o [ 40%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDMultiProcessTest.cpp.o [ 42%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDTestMain.cpp.o [ 47%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDTopologyTest.cpp.o [ 52%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDTestUtilQueue.cpp.o [ 50%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDMemoryTest.cpp.o [ 52%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDTestUtil.cpp.o [ 54%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDLocalMemoryTest.cpp.o [ 57%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDOpenCloseKFDTest.cpp.o [ 59%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDEventTest.cpp.o [ 61%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDQMTest.cpp.o [ 64%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDCWSRTest.cpp.o [ 66%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDExceptionTest.cpp.o [ 69%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDGraphicsInterop.cpp.o [ 71%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDPerfCounters.cpp.o [ 73%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDGWSTest.cpp.o [ 76%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDIPCTest.cpp.o [ 78%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDASMTest.cpp.o [ 80%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDEvictTest.cpp.o [ 83%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDHWSTest.cpp.o [ 85%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDPerformanceTest.cpp.o [ 88%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDPMTest.cpp.o [ 90%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDSVMRangeTest.cpp.o [ 92%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDSVMEvictTest.cpp.o [ 95%] Building CXX object CMakeFiles/kfdtest.dir/src/KFDRASTest.cpp.o [ 97%] Building CXX object CMakeFiles/kfdtest.dir/src/RDMATest.cpp.o [100%] Linking CXX executable kfdtest [100%] Built target kfdtest ++++ Starting testing node 1 (vega10) ++++ Note: Google Test filter = -KFDEventTest.MeasureInterruptConsumption:KFDPMTest.SuspendWithActiveProcess:KFDPMTest.SuspendWithIdleQueue:KFDPMTest.SuspendWithIdleQueueAfterWork:KFDLocalMemoryTest.Fragmentation:KFDQMTest.BasicCuMaskingLinear:RDMATest.GPUDirect:KFDRASTest.*:KFDLocalMemoryTest.CheckZeroInitializationVram:KFDQMTest.GPUDoorbellWrite:KFDQMTest.mGPUShareBO:KFDQMTest.SdmaEventInterrupt:KFDMemoryTest.CacheInvalidateOnRemoteWrite:KFDEvictTest.BurstyTest:KFDHWSTest.*:KFDSVMRangeTest.ReadOnlyRangeTest:KFDIPCTest.BasicTest:KFDIPCTest.CMABasicTest:KFDIPCTest.CrossMemoryAttachTest:KFDQMTest.AllSdmaQueues [==========] Running 119 tests from 17 test cases. [----------] Global test environment set-up. [----------] 1 test from KFDCloseKFDTest [ RUN ] KFDCloseKFDTest.CloseAClosedKfd [ OK ] KFDCloseKFDTest.CloseAClosedKfd (0 ms) [----------] 1 test from KFDCloseKFDTest (0 ms total) [----------] 3 tests from KFDOpenCloseKFDTest [ RUN ] KFDOpenCloseKFDTest.OpenAlreadyOpenedKFD [ OK ] KFDOpenCloseKFDTest.OpenAlreadyOpenedKFD (1 ms) [ RUN ] KFDOpenCloseKFDTest.OpenCloseKFD [ OK ] KFDOpenCloseKFDTest.OpenCloseKFD (0 ms) [ RUN ] KFDOpenCloseKFDTest.InvalidKFDHandleTest [ OK ] KFDOpenCloseKFDTest.InvalidKFDHandleTest (0 ms) [----------] 3 tests from KFDOpenCloseKFDTest (1 ms total) [----------] 7 tests from KFDTopologyTest [ RUN ] KFDTopologyTest.BasicTest [ OK ] KFDTopologyTest.BasicTest (6 ms) [ RUN ] KFDTopologyTest.GetNodePropertiesInvalidParams [ OK ] KFDTopologyTest.GetNodePropertiesInvalidParams (5 ms) [ RUN ] KFDTopologyTest.GetNodePropertiesInvalidNodeNum [ OK ] KFDTopologyTest.GetNodePropertiesInvalidNodeNum (5 ms) [ RUN ] KFDTopologyTest.GetNodeMemoryProperties [ OK ] KFDTopologyTest.GetNodeMemoryProperties (5 ms) [ RUN ] KFDTopologyTest.GpuvmApertureValidate [ OK ] KFDTopologyTest.GpuvmApertureValidate (5 ms) [ RUN ] KFDTopologyTest.GetNodeCacheProperties [ OK ] KFDTopologyTest.GetNodeCacheProperties (8 ms) [ RUN ] KFDTopologyTest.GetNodeIoLinkProperties [ OK ] KFDTopologyTest.GetNodeIoLinkProperties (5 ms) [----------] 7 tests from KFDTopologyTest (39 ms total) [----------] 29 tests from KFDMemoryTest [ RUN ] KFDMemoryTest.MMapLarge [ OK ] KFDMemoryTest.MMapLarge (11897 ms) [ RUN ] KFDMemoryTest.MapUnmapToNodes [ OK ] KFDMemoryTest.MapUnmapToNodes (5 ms) [ RUN ] KFDMemoryTest.MapMemoryToGPU [ OK ] KFDMemoryTest.MapMemoryToGPU (5 ms) [ RUN ] KFDMemoryTest.InvalidMemoryPointerAlloc [ OK ] KFDMemoryTest.InvalidMemoryPointerAlloc (5 ms) [ RUN ] KFDMemoryTest.ZeroMemorySizeAlloc [ OK ] KFDMemoryTest.ZeroMemorySizeAlloc (4 ms) [ RUN ] KFDMemoryTest.MemoryAlloc [ OK ] KFDMemoryTest.MemoryAlloc (5 ms) [ RUN ] KFDMemoryTest.AccessPPRMem [ OK ] KFDMemoryTest.AccessPPRMem (5 ms) [ RUN ] KFDMemoryTest.MemoryRegister [ OK ] KFDMemoryTest.MemoryRegister (26 ms) [ RUN ] KFDMemoryTest.MemoryRegisterSamePtr [ OK ] KFDMemoryTest.MemoryRegisterSamePtr (34 ms) [ RUN ] KFDMemoryTest.FlatScratchAccess [ OK ] KFDMemoryTest.FlatScratchAccess (33 ms) [ RUN ] KFDMemoryTest.GetTileConfigTest [ OK ] KFDMemoryTest.GetTileConfigTest (5 ms) [ RUN ] KFDMemoryTest.LargestSysBufferTest ```

Kernel error log

``` amdgpu: init_user_pages: Failed to get user pages: -14 Out of memory: Killed process 47768 (kfdtest) total-vm:29838556kB, anon-rss:29386608kB, file-rss:4kB, shmem-rss:8kB, UID:1000 pgtables:57592kB oom_score_adj:0 oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-6.scope,task=kfdtest,pid=47768,uid=1000 0 pages hwpoisoned 169859 pages reserved 0 pages HighMem/MovableOnly 8369380 pages RAM Total swap = 0kB Free swap = 0kB 0 pages in swap cache 521176 total pagecache pages Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 Normal: 840*4kB (UME) 615*8kB (UME) 366*16kB (UME) 652*32kB (UME) 279*64kB (UME) 66*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 61304kB Node 0 DMA32: 2*4kB (UM) 2*8kB (UM) 2*16kB (UM) 0*32kB 2*64kB (UM) 0*128kB 1*256kB (M) 0*512kB 1*1024kB (M) 1*2048kB (M) 29*4096kB (UM) = 122296kB Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB lowmem_reserve[]: 0 0 0 0 0 Node 0 Normal free:61052kB boost:0kB min:61420kB low:91192kB high:120964kB reserved_highatomic:0KB active_anon:28480028kB inactive_anon:436104kB active_file:4700kB inactive_file:4396kB unevictable:32kB writepending:2848kB present:30395392kB managed:29782504kB mlocked:32kB bounce:> lowmem_reserve[]: 0 0 29077 29077 29077 Node 0 DMA32 free:122296kB boost:0kB min:6128kB low:9096kB high:12064kB reserved_highatomic:0KB active_anon:2865600kB inactive_anon:304kB active_file:0kB inactive_file:124kB unevictable:0kB writepending:0kB present:3066132kB managed:3000220kB mlocked:0kB bounce:0kB free_pcp:248kB> lowmem_reserve[]: 0 2902 31979 31979 31979 Node 0 DMA free:11264kB boost:0kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Node 0 active_anon:31345624kB inactive_anon:436412kB active_file:4572kB inactive_file:5032kB unevictable:32kB isolated(anon):0kB isolated(file):0kB mapped:2156kB dirty:2848kB writeback:0kB shmem:2075440kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_> active_anon:7836406 inactive_anon:109103 isolated_anon:0 active_file:1143 inactive_file:1258 isolated_file:0 unevictable:8 dirty:712 writeback:0 slab_reclaimable:16279 slab_unreclaimable:40760 mapped:539 shmem:518860 pagetables:16151 bounce:0 kernel_misc_reclaimable:0 free:48653 free_pcp:62 free_cma:0 Mem-Info: R13: 0000000000000003 R14: 00007ffeee021fc8 R15: 0000000000000042 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c0284b16 RBP: 00007ffeee021f20 R08: 00007ffeee021fc8 R09: 00000000c4000004 RDX: 00007ffeee021f20 RSI: 00000000c0284b16 RDI: 0000000000000003 RAX: ffffffffffffffda RBX: 00000000c4000004 RCX: 00007ffb0baa1c0f RSP: 002b:00007ffeee021e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Code: Unable to access opcode bytes at RIP 0x7ffb0baa1be5. RIP: 0033:0x7ffb0baa1c0f entry_SYSCALL_64_after_hwframe+0x63/0xcd ? exit_to_user_mode_prepare+0x33/0x90 ? do_syscall_64+0x71/0x90 ? do_syscall_64+0x71/0x90 ? do_syscall_64+0x71/0x90 do_syscall_64+0x62/0x90 __se_sys_ioctl+0x6d/0xb0 ? do_syscall_64+0x71/0x90 ? syscall_exit_to_user_mode+0x28/0xd0 ? exit_to_user_mode_prepare+0x33/0x90 ? kfd_ioctl_acquire_vm+0xa0/0xa0 [amdgpu] kfd_ioctl+0x24c/0x3e0 [amdgpu] kfd_ioctl_alloc_memory_of_gpu+0x1d9/0x2f0 [amdgpu] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x56e/0x720 [amdgpu] init_user_pages+0xbd/0x250 [amdgpu] amdgpu_ttm_tt_get_user_pages+0xc8/0x160 [amdgpu] amdgpu_hmm_range_get_pages+0x152/0x2e0 [amdgpu] hmm_range_fault+0x88/0xc0 walk_page_range+0x83/0x1e0 __walk_page_range+0x60/0x1b0 walk_pgd_range+0x111/0x170 walk_p4d_range+0x1d2/0x7a0 hmm_vma_walk_hole+0x160/0x1f0 handle_mm_fault+0xf7/0x250 __handle_mm_fault+0x709/0x8c0 do_anonymous_page+0x20f/0x5e0 vma_alloc_folio+0x29e/0x3d0 __folio_alloc+0xf/0x30 __alloc_pages+0x255/0x2f0 __alloc_pages_slowpath+0xa93/0xe40 out_of_memory+0x317/0x430 oom_kill_process+0x154/0x270 dump_header+0x50/0x260 dump_stack_lvl+0x45/0x5a ```

rocminfo

``` ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 9 5900X 12-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 5900X 12-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 4951 BDFID: 0 Internal Node ID: 0 Compute Unit: 24 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 32798084(0x1f47584) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32798084(0x1f47584) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 32798084(0x1f47584) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx900 Uuid: GPU-0213f2a912ee21a4 Marketing Name: AMD Radeon RX Vega Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 4096(0x1000) KB Chip ID: 26751(0x687f) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1590 BDFID: 10496 Internal Node ID: 1 Compute Unit: 56 SIMDs per CU: 4 Shader Engines: 4 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 8372224(0x7fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx900:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```

Kernel version

Linxu 6.0.5-native_amd-xanmod1-1 #1 SMP PREEMPT_DYNAMIC Sat, 29 Oct 2022 07:39:20 +0000 x86_64 GNU/Linux
fxkamd commented 2 years ago

The test tries to allocate a maximum amount of system memory for GPU access. It looks like it ends up invoking the OOM killer. The log snippet in your report only shows it killing kfdtest. That should not have killed your whole login session though. Something else must have broken due to the memory pressure.

KFD limits memory usage by ROCm applications to try and prevent putting so much memory pressure on the system that the OOM killer has to step in. However, we do want applications to be able to use most system memory. So this limit is quite high. Most of the time we test on headless compute-servers. Your situation on a workstation with a display server and other applications running is probably quite different.

It may be worth spending more time to understand the problem. Ideally the oom killer should only kill kfdtest and leave the rest of the system running and responsive.

In the mean time, you can blacklist the problematic test in kfdtest.exclude so it doesn't crash your system everytime you run the test.

tpkessler commented 2 years ago

Most of the time we test on headless compute-servers.

I see. That may very well explain the failure of certain tests on my machine. In case it matters: I'm running sway on Wayland. EDIT: It's also crashing when running it in without a display server.

you can blacklist the problematic test

I did so (by passing them through the -e flag). In addition to KFDMemoryTest.LargestSysBufferTest also KFDMemoryTest.BigSysBufferStressTest and KFDQMTest.CreateQueueStressSingleThreaded cause the system to crash. The display manager is killed and I have to restart / re-login.

One performance test in KFDQMTest.BasicCuMaskingEven was outside the 15% margin. In total, 115 tests passed on my system and one failed.

fxkamd commented 1 year ago

If CreateQueueStressSingleThreaded causes a crash, the problem is probably that graphics command submissions are timing out because compute is causing too much stress. CreateQueueStressSingleThreaded doesn't use a lot of the compute units, its memory usage shouldn't be extreme, and it shouldn't affect the execution of the graphics queue directly. It does cause lots of TLB and cache flushes and maybe DMA engine load from page table management and memory initialization.

tpkessler commented 1 year ago

Thanks for your detailed explanation! So you think it's not HSA related by caused by amdgpu?

fxkamd commented 1 year ago

KFD is technically part of amdgpu.ko. KFD shares the GPU compute resources and VRAM with graphics. So it is possible that using the GPU for compute affects graphics usage negatively. So if you see crashes or hangs in graphics applications while running kfdtest, I would say that HSA is causing the problems.

tpkessler commented 1 year ago

How can I help you to fix these problems? Do you need better logs?

schung-amd commented 2 months ago

Hi @tpkessler, sorry for the delay. Are you still experiencing this issue?

schung-amd commented 1 month ago

Closing this for now, feel free to comment if you are still experiencing this issue or want further guidance and we can reopen it.