ROCm / ROCm-OpenCL-Runtime

ROCm OpenOpenCL Runtime
166 stars 60 forks source link

ROCclr-rocm-5.4.3/os/os_posix.cpp:305: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed #158

Open darkbasic opened 1 year ago

darkbasic commented 1 year ago
niko@talos2 ~ $ clinfo
clinfo: /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCclr-rocm-5.4.3/os/os_posix.cpp:305: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.
Aborted (core dumped)

I'm on Gentoo Linux ppc64le (4K page size) using linux-6.1.12. GPU is AMD RX 570 (mesa git master). LLVM is 15.0.7.

rocm-opencl-runtime-5.4.3 compiles fine but as soon as I run clinfo it crashes.

The coredump looks completely useless:

Mon 2023-02-20 15:01:01 CET   212914 1000 1000 SIGABRT present  /usr/bin/clinfo                                                                              1.5M
talos2 ~ # coredumpctl gdb 212914
           PID: 212914 (clinfo)
           UID: 1000 (niko)
           GID: 1000 (niko)
        Signal: 6 (ABRT)
     Timestamp: Mon 2023-02-20 15:01:01 CET (46s ago)
  Command Line: clinfo
    Executable: /usr/bin/clinfo
 Control Group: /user.slice/user-1000.slice/user@1000.service/session.slice/vte-spawn-4515b441-519b-4357-9405-43fc61d5f6db.scope
          Unit: user@1000.service
     User Unit: vte-spawn-4515b441-519b-4357-9405-43fc61d5f6db.scope
         Slice: user-1000.slice
     Owner UID: 1000 (niko)
       Boot ID: 0dca6c1f75ea46d7b02761482c0ec1d6
    Machine ID: b3e834569b8ff461391f5ac061feb773
      Hostname: talos2
       Storage: /var/lib/systemd/coredump/core.clinfo.1000.0dca6c1f75ea46d7b02761482c0ec1d6.212914.1676901661000000.zst (present)
  Size on Disk: 1.5M
       Message: Process 212914 (clinfo) of user 1000 dumped core.

GNU gdb (Gentoo 12.1 vanilla) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "powerpc64le-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/clinfo...
(No debugging symbols found in /usr/bin/clinfo)
[New LWP 212914]
[New LWP 212918]
[New LWP 212915]
[New LWP 212916]
[New LWP 212917]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `clinfo '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00003fff8c7e603c in ?? () from /usr/lib64/libc.so.6
[Current thread is 1 (Thread 0x3fff8ca58020 (LWP 212914))]
(gdb) info threads
  Id   Target Id                          Frame 
* 1    Thread 0x3fff8ca58020 (LWP 212914) 0x00003fff8c7e603c in ?? () from /usr/lib64/libc.so.6
  2    Thread 0x3fff7abab120 (LWP 212918) 0x00003fff8c7ddf84 in ?? () from /usr/lib64/libc.so.6
  3    Thread 0x3fff7c4ef120 (LWP 212915) 0x00003fff8c7ddf84 in ?? () from /usr/lib64/libc.so.6
  4    Thread 0x3fff7bbad120 (LWP 212916) 0x00003fff8c7ddf84 in ?? () from /usr/lib64/libc.so.6
  5    Thread 0x3fff7b3ac120 (LWP 212917) 0x00003fff8c7ddf84 in ?? () from /usr/lib64/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 0x3fff8ca58020 (LWP 212914))]
#0  0x00003fff8c7e603c in ?? () from /usr/lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x3fff7abab120 (LWP 212918))]
#0  0x00003fff8c7ddf84 in ?? () from /usr/lib64/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x3fff7c4ef120 (LWP 212915))]
#0  0x00003fff8c7ddf84 in ?? () from /usr/lib64/libc.so.6
(gdb) thread 4
[Switching to thread 4 (Thread 0x3fff7bbad120 (LWP 212916))]
#0  0x00003fff8c7ddf84 in ?? () from /usr/lib64/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread 0x3fff7b3ac120 (LWP 212917))]
#0  0x00003fff8c7ddf84 in ?? () from /usr/lib64/libc.so.6
(gdb) quit

I'm using dev-libs/rocm-opencl-runtime-5.4.3 and dev-libs/rocr-runtime, dev-libs/rocm-comgr, dev-libs/rocm-device-libs, dev-util/rocm-cmake and dev-libs/roct-thunk-interface 5.4.3 as well.

I've compiled dev-libs/rocm-opencl-runtime and dev-libs/rocr-runtime with debug symbols:

FEATURES="${FEATURES} nostrip"
CFLAGS="${CFLAGS} -ggdb3 -Wall"
CXXFLAGS="${CFLAGS}"

and I've enabled the debug use flag to enable assertions and other debug code paths as well.

I've tried ROCm-OpenCL-Runtime from git master but it still gives me the very same error at runtime.

darkbasic commented 1 year ago

I noticed that when I try to compile dev-libs/rocm-opencl-runtime-5.4.3 with the test use flag it fails:

[122/224] /usr/bin/powerpc64le-unknown-linux-gnu-g++ -DCL_TARGET_OPENCL_VERSION=220 -DEMU_ENV=1 -DUSE_OPENGL=1 -Doclperf_EXPORTS -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/khronos/headers/opencl2.2 -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/include -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/common -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/include -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/amdocl  -O2 -pipe -mcpu=power9 -mtune=power9 -ggdb3 -Wall -fPIC -std=c++14 -MD -MT tests/ocltst/module/perf/CMakeFiles/oclperf.dir/OCLPerfKernelThroughput.cpp.o -MF tests/ocltst/module/perf/CMakeFiles/oclperf.dir/OCLPerfKernelThroughput.cpp.o.d -o tests/ocltst/module/perf/CMakeFiles/oclperf.dir/OCLPerfKernelThroughput.cpp.o -c /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.cpp
FAILED: tests/ocltst/module/perf/CMakeFiles/oclperf.dir/OCLPerfKernelThroughput.cpp.o 
/usr/bin/powerpc64le-unknown-linux-gnu-g++ -DCL_TARGET_OPENCL_VERSION=220 -DEMU_ENV=1 -DUSE_OPENGL=1 -Doclperf_EXPORTS -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/khronos/headers/opencl2.2 -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/include -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/common -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/include -I/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/amdocl  -O2 -pipe -mcpu=power9 -mtune=power9 -ggdb3 -Wall -fPIC -std=c++14 -MD -MT tests/ocltst/module/perf/CMakeFiles/oclperf.dir/OCLPerfKernelThroughput.cpp.o -MF tests/ocltst/module/perf/CMakeFiles/oclperf.dir/OCLPerfKernelThroughput.cpp.o.d -o tests/ocltst/module/perf/CMakeFiles/oclperf.dir/OCLPerfKernelThroughput.cpp.o -c /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.cpp
In file included from /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.cpp:21:
/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.h:48:16: error: typedef ‘CPUKernel’ is initialized (use ‘decltype’ instead)
   48 | typedef void (*CPUKernel)(__m128 *, __m128 *, unsigned int);
      |                ^~~~~~~~~
/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.h:48:27: error: ‘__m128’ was not declared in this scope; did you mean ‘__ibm128’?
   48 | typedef void (*CPUKernel)(__m128 *, __m128 *, unsigned int);
      |                           ^~~~~~
      |                           __ibm128
/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.h:48:35: error: expected primary-expression before ‘,’ token
   48 | typedef void (*CPUKernel)(__m128 *, __m128 *, unsigned int);
      |                                   ^
/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.h:48:37: error: ‘__m128’ was not declared in this scope; did you mean ‘__ibm128’?
   48 | typedef void (*CPUKernel)(__m128 *, __m128 *, unsigned int);
      |                                     ^~~~~~
      |                                     __ibm128
/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.h:48:45: error: expected primary-expression before ‘,’ token
   48 | typedef void (*CPUKernel)(__m128 *, __m128 *, unsigned int);
      |                                             ^
/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCm-OpenCL-Runtime-rocm-5.4.3/tests/ocltst/module/perf/OCLPerfKernelThroughput.h:48:47: error: expected primary-expression before ‘unsigned’
   48 | typedef void (*CPUKernel)(__m128 *, __m128 *, unsigned int);
      |                                               ^~~~~~~~
darkbasic commented 1 year ago

rocm-opencl-runtime-tests-ppc64.patch.txt

The following patch fixes compilation of the tests, which fail with the same error as clinfo:

PORTAGE_USERNAME=niko PORTAGE_GRPNAME=niko OCLGL_DISPLAY=${DISPLAY} OCLGL_XAUTHORITY=${XAUTHORITY} FEATURES=test USE=test emerge -v --oneshot rocm-opencl-runtime

>>> Test phase: dev-libs/rocm-opencl-runtime-5.4.3

 * Running oclgl test under DISPLAY :0 ...
OpenGL vendor string: AMD
Built for Emulation Environment
ocltst: /var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/work/ROCclr-rocm-5.4.3/os/os_posix.cpp:305: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.
/var/tmp/portage/dev-libs/rocm-opencl-runtime-5.4.3/temp/environment: line 2213:    38 Aborted                 (core dumped) ./ocltst -m $(realpath liboclgl.so) -A ogl.exclude
Ashark commented 1 year ago

I get this error when running davinci resolve (the error is in the pre-pre last line):

[andrey@unihost DaVinci Resolve]$ LC_ALL=C ROC_ENABLE_PRE_VEGA=1 wine Resolve.exe
0084:fixme:hid:handle_IRP_MN_QUERY_ID Unhandled type 00000005
0084:fixme:hid:handle_IRP_MN_QUERY_ID Unhandled type 00000005
0084:fixme:hid:handle_IRP_MN_QUERY_ID Unhandled type 00000005
0084:fixme:hid:handle_IRP_MN_QUERY_ID Unhandled type 00000005
0084:fixme:wineusb:query_id Unhandled ID query type 0x5.
010c:fixme:actctx:parse_depend_manifests Could not find dependent assembly L"SMDK-VC140-x64-4_21_0" (4.21.0.159)
010c:err:winediag:load_odbc failed to open library "libodbc.so": libodbc.so: cannot open shared object file: No such file or directory
010c:fixme:reg:NtNotifyChangeMultipleKeys Unimplemented optional parameter
010c:fixme:reg:NtNotifyChangeMultipleKeys Unimplemented optional parameter
010c:fixme:reg:NtNotifyChangeMultipleKeys Unimplemented optional parameter
010c:fixme:reg:NtNotifyChangeMultipleKeys Unimplemented optional parameter
0110:fixme:combase:RoActivateInstance (00007FFFFFCEFB70, 00007FFFFFCEFA78): semi-stub
0110:fixme:combase:RoGetActivationFactory (L"Windows.Management.Deployment.PackageManager", {00000035-0000-0000-c000-000000000046}, 00007FFFFFCEF978): semi-stub
0110:err:combase:RoGetActivationFactory Failed to find library for L"Windows.Management.Deployment.PackageManager"
010c:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (000000000021FC20 1 C) semi-stub
010c:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (000000000021FC80 1 C) semi-stub
ActCCMessage Already in Table: Code= c005, Mode= 13, Level=  1, CmdKey= -1, Option= 0
ActCCMessage Already in Table: Code= c006, Mode= 13, Level=  1, CmdKey= -1, Option= 0
ActCCMessage Already in Table: Code= c007, Mode= 13, Level=  1, CmdKey= -1, Option= 0
ActCCMessage Already in Table: Code= 2282, Mode=  0, Level=  0, CmdKey= 8, Option= 0
PnlMsgActionStringAdapter Already in Table: Code= 615e, Mode=  0, Level=  0, CmdKey= -1, Option= 0
010c:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (000000000021F840 1 C) semi-stub
010c:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (000000000021F480 1 C) semi-stub
18.5.0b.0016 Windows/MSVC x86_64
Main thread starts: 0000010C
QCoreApplication::applicationDirPath: Please instantiate the QApplication object first
[0x0000010c] | Undefined            | INFO  | 2023-05-01 16:42:56,948 | --------------------------------------------------------------------------------
010c:fixme:file:NtLockFile I/O completion on lock not implemented yet
[0x0000010c] | Undefined            | INFO  | 2023-05-01 16:42:56,948 | Loaded log config from C:\users\andrey\AppData\Roaming\Blackmagic Design\DaVinci Resolve\Preferences\log-conf.xml
[0x0000010c] | Undefined            | INFO  | 2023-05-01 16:42:56,948 | --------------------------------------------------------------------------------
010c:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (000000000021BCF0 1 C) semi-stub
010c:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (000000000021BCD0 1 C) semi-stub
010c:fixme:ntdll:NtQuerySystemInformation info_class SYSTEM_PERFORMANCE_INFORMATION
m Files\Blackmagic Design\DaVinci Resolve\Resolve.exe: /usr/src/debug/rocm-opencl-runtime/ROCclr-rocm-5.4.3/os/os_posix.cpp:305: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.
010c:err:seh:call_stack_handlers invalid frame 000000000011F270 (0000000000122000-0000000000220000)
010c:err:seh:NtRaiseException Exception frame is not in stack limits => unable to dispatch exception.
leavelet commented 1 year ago

Same problem with loongarch64