ROCm / clr

MIT License
85 stars 35 forks source link

[Issue]: clr-rocm-6.0.2/rocclr/os/os_posix.cpp:321: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed. #61

Open darkbasic opened 4 months ago

darkbasic commented 4 months ago

Problem Description

I'm on Gentoo Linux ppc64le (4K page size) using linux-6.7.6. GPU is AMD RX 570 (mesa 24.0.1). LLVM is 17.0.6. I managed to successfully build rocm-opencl-runtime-6.0.2, but I had to use the -DNO_WARN_X86_INTRINSICS compile flag otherwise it fails. Full build log without -DNO_WARN_X86_INTRINSICS: rocm-opencl-runtime-6.0.2.build.log I'm also carrying this patch since v5 which used to fix tests:

--- ./opencl/tests/ocltst/module/perf/OCLPerfKernelThroughput.h.orig    2024-02-26 09:53:53.925778934 +0100
+++ ./opencl/tests/ocltst/module/perf/OCLPerfKernelThroughput.h 2024-02-26 09:54:09.165774504 +0100
@@ -45,7 +45,7 @@
 #define UNSIGNED_LARGE_INT unsigned long long
 #define MAX_LOOP_ITER 10
 typedef cl_float4 float4;
-typedef void (*CPUKernel)(__m128 *, __m128 *, unsigned int);
+typedef void (*CPUKernel)(__ibm128 *, __ibm128 *, unsigned int);

 class OCLPerfKernelThroughput : public OCLTestImp {
  public:

Unfortunately both clinfo and rocminfo still fail at runtime like they used to fail with 5.4.3:

talos2 ~ # clinfo 
clinfo: /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:321: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.
Aborted (core dumped)

clinfo: /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:321: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

Program received signal SIGABRT, Aborted.
0x00003ffff7ca819c in ?? () from /usr/lib64/libc.so.6
(gdb) backtrace
#0  0x00003ffff7ca819c in ?? () from /usr/lib64/libc.so.6
#1  0x00003ffff7c4525c in raise () from /usr/lib64/libc.so.6
#2  0x00003ffff7c2543c in abort () from /usr/lib64/libc.so.6
#3  0x00003ffff7c39398 in ?? () from /usr/lib64/libc.so.6
#4  0x00003ffff7c39444 in __assert_fail () from /usr/lib64/libc.so.6
#5  0x00003ffff78cd504 in amd::Os::currentStackInfo (base=base@entry=0x100073630, size=size@entry=0x100073638) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:321
#6  0x00003ffff78fbd98 in amd::HostThread::HostThread (this=0x1000735d0) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/thread/thread.cpp:34
#7  0x00003ffff78fbe8c in amd::Thread::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/thread/thread.cpp:170
#8  0x00003ffff78ccae8 in amd::Os::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:170
#9  amd::Os::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:155
#10 0x00003ffff783d0b8 in amd::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-6.0.2/work/clr-rocm-6.0.2/rocclr/os/os_posix.cpp:136
#11 0x00003ffff7fa5dfc in ?? () from /lib64/ld64.so.2
#12 0x00003ffff7fb9f18 in ?? () from /lib64/ld64.so.2
#13 0x00003ffff7f9f420 in _dl_catch_exception () from /lib64/ld64.so.2
#14 0x00003ffff7fba0d8 in ?? () from /lib64/ld64.so.2
#15 0x00003ffff7f9f37c in _dl_catch_exception () from /lib64/ld64.so.2
#16 0x00003ffff7fbb97c in ?? () from /lib64/ld64.so.2
#17 0x00003ffff7c9ed24 in ?? () from /usr/lib64/libc.so.6
#18 0x00003ffff7f9f37c in _dl_catch_exception () from /lib64/ld64.so.2
#19 0x00003ffff7f9f4fc in ?? () from /lib64/ld64.so.2
#20 0x00003ffff7c9e5f8 in ?? () from /usr/lib64/libc.so.6
#21 0x00003ffff7c9ee34 in dlopen () from /usr/lib64/libc.so.6
#22 0x00003ffff7f408a0 in ?? () from /usr/lib64/libOpenCL.so.1
#23 0x00003ffff7f3419c in ?? () from /usr/lib64/libOpenCL.so.1
#24 0x00003ffff7f40228 in ?? () from /usr/lib64/libOpenCL.so.1
#25 0x00003ffff7f404e4 in ?? () from /usr/lib64/libOpenCL.so.1
#26 0x00003ffff7cacf40 in ?? () from /usr/lib64/libc.so.6
#27 0x00003ffff7f40858 in ?? () from /usr/lib64/libOpenCL.so.1
#28 0x00003ffff7f34118 in ?? () from /usr/lib64/libOpenCL.so.1
#29 0x00003ffff7f36498 in clGetPlatformIDs () from /usr/lib64/libOpenCL.so.1
#30 0x0000000100008b58 in ?? ()
#31 0x00003ffff7c25c2c in ?? () from /usr/lib64/libc.so.6
#32 0x00003ffff7c25e6c in __libc_start_main () from /usr/lib64/libc.so.6
#33 0x0000000000000000 in ?? ()
talos2 ~ # rocminfo 
ROCk module is loaded
Segmentation fault (core dumped)

ROCk module is loaded

Program received signal SIGSEGV, Segmentation fault.
0x00003ffff7e5840c in rocr::os::callback (info=0x3fffffffda60, size=<optimized out>, data=0x3fffffffdb40) at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/util/lnx/os_linux.cpp:314
warning: 314    /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/util/lnx/os_linux.cpp: No such file or directory
(gdb) backtrace
#0  0x00003ffff7e5840c in rocr::os::callback (info=0x3fffffffda60, size=<optimized out>, data=0x3fffffffdb40) at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/util/lnx/os_linux.cpp:314
#1  0x00003ffff77be50c in dl_iterate_phdr () from /usr/lib64/libc.so.6
#2  0x00003ffff7e58780 in rocr::os::GetLoadedToolsLib () at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/util/lnx/os_linux.cpp:332
#3  0x00003ffff7ebc3a8 in rocr::core::Runtime::LoadTools (this=this@entry=0x10003f1b0) at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/runtime/runtime.cpp:1745
#4  0x00003ffff7ebd460 in rocr::core::Runtime::Load (this=0x10003f1b0) at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/runtime/runtime.cpp:1539
#5  0x00003ffff7ebd688 in rocr::core::Runtime::Acquire () at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/runtime/runtime.cpp:116
#6  0x00003ffff7e8e1e8 in rocr::HSA::hsa_init () at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/runtime/hsa.cpp:206
#7  0x00003ffff7ed42fc in hsa_init () at /var/tmp/portage/dev-libs/rocr-runtime-6.0.2/work/ROCR-Runtime-rocm-6.0.2/src/core/common/hsa_table_interface.cpp:68
#8  0x00000001000027cc in ?? ()
#9  0x00003ffff7625c2c in ?? () from /usr/lib64/libc.so.6
#10 0x00003ffff7625e6c in __libc_start_main () from /usr/lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

Operating System

Gentoo Linux ppc64le (4K page size)

CPU

IBM Power 9

GPU

AMD RX 570

ROCm Version

ROCm 6.0.2

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

cjatin commented 4 months ago

AFAIK HIP is not tested on POWER Arch and is written keeping x86_64 in mind. So getting this to work might require more work than just fixing compilation errors of missing intrinsic.

The GPU you have is also not supported on ROCm 6.0

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html

darkbasic commented 4 months ago

AFAIK HIP is not tested on POWER Arch and is written keeping x86_64 in mind. So getting this to work might require more work than just fixing compilation errors of missing intrinsic.

Early versions of ROCm claimed to support ppc64le. Also Adam Tran from AMD said it should work starting from 6.0.2, that's why I've re-tested it.

The GPU you have is also not supported on ROCm 6.0

Yeah I know, but at least the OpenCL part works (or at least used to work last time I've tested it) on x86_64.

darkbasic commented 4 months ago

I've found a similar error for the RX 6900 XT on x86_64: https://github.com/Mozilla-Ocho/llamafile/issues/214 Is it possible that somehow ROCm regressed and RX 570 doesn't work on x86_64 anymore? Can someone confirm? I'm sure OpenCL used to work but a couple of years have passed since last time I've tested it on x86_64.