Closed tflink closed 1 year ago
Hmm.. I'm not super familiar with this component, but it looks like hipRTC is trying to call a null function pointer in libamd_comgr.so. Is libamd_comgr.so present in the default library search path when this is run? Is it loaded, according to gdb's info sharedlibrary
command when the crash occurs?
libamd_comgr.so
is in /usr/lib64
which is part of the default search path on this system
# ll /usr/lib64/libamd_comgr*
lrwxrwxrwx. 1 root root 17 May 24 18:00 /usr/lib64/libamd_comgr.so -> libamd_comgr.so.2
lrwxrwxrwx. 1 root root 21 May 24 18:00 /usr/lib64/libamd_comgr.so.2 -> libamd_comgr.so.2.5.0
-rwxr-xr-x. 1 root root 9484672 May 24 18:00 /usr/lib64/libamd_comgr.so.2.5.0
Is it loaded, according to gdb's info sharedlibrary command when the crash occurs?
:(gdb) info sharedlibrary From To Syms Read Shared Object Library 0x00007ffff7fcb000 0x00007ffff7ff0eb5 Yes /lib64/ld-linux-x86-64.so.2 0x00007ffff7e78970 0x00007ffff7f8207b Yes /lib64/libsqlite3.so.0 0x00007ffff7d8c6a0 0x00007ffff7dea598 Yes /lib64/libhiprtc.so.5 0x00007ffff661e230 0x00007ffff6960468 Yes /lib64/libamdhip64.so.5 0x00007ffff62a4590 0x00007ffff63c5d22 Yes /lib64/libstdc++.so.6 0x00007ffff7cb63d0 0x00007ffff7d2a8f8 Yes /lib64/libm.so.6 0x00007ffff65df6d0 0x00007ffff65f9f75 Yes /lib64/libgcc_s.so.1 0x00007ffff5e26780 0x00007ffff5f83d8d Yes /lib64/libc.so.6 0x00007ffff65c5630 0x00007ffff65d317b Yes /lib64/libz.so.1 0x00007ffff540e4d0 0x00007ffff5463f72 Yes /lib64/libamd_comgr.so.2 0x00007ffff5017070 0x00007ffff50d2ac9 Yes /lib64/libhsa-runtime64.so.1 0x00007ffff65b7c10 0x00007ffff65bd035 Yes /lib64/libnuma.so.1 0x00007ffff4c4e090 0x00007ffff4e2056a Yes /lib64/liblldELF.so.16 0x00007ffff6591010 0x00007ffff65aa32f Yes /lib64/liblldCommon.so.16 0x00007ffff1367510 0x00007ffff3f33417 Yes /lib64/libclang-cpp.so.16 0x00007fffe9ec0840 0x00007fffed568622 Yes /lib64/libLLVM-16.so 0x00007ffff6562b10 0x00007ffff6574950 Yes /lib64/libhsakmt.so.1 0x00007ffff65447b0 0x00007ffff6556495 Yes /lib64/libelf.so.1 0x00007ffff652eb90 0x00007ffff6539681 Yes /lib64/libdrm.so.2 0x00007ffff651f5b0 0x00007ffff652514a Yes /lib64/libffi.so.8 0x00007ffff64ea8d0 0x00007ffff6509bf4 Yes /lib64/libedit.so.0 0x00007ffff64ba670 0x00007ffff64cc8a4 Yes /lib64/libtinfo.so.6 0x00007ffff64a1830 0x00007ffff64a61c1 Yes /lib64/libdrm_amdgpu.so.1 0x00007ffff6149e40 0x00007ffff61f01f2 Yes /lib64/libzstd.so.1
@cgmb FYI
@evetsso, if it helps, these are complete steps to reproduce the issue in a docker image:
$ docker run -it --rm --device=/dev/dri --device=/dev/kfd --security-opt seccomp=unconfined fedora:rawhide
dnf upgrade
dnf install rocm-hip-devel rocrand-devel sqlite-devel rocm-comgr-devel hsakmt-devel rocm-runtime-devel git vim cmake lld clang clang-libs clang-resource-filesystem clang-tools-extra llvm gdb
git clone https://github.com/ROCmSoftwarePlatform/rocFFT.git
cd rocFFT
git checkout release/rocm-rel-5.5
mkdir build
cd build
cmake .. -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_C_COMPILER=hipcc -DSQLITE_USE_SYSTEM_PACKAGE=ON -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Debug
VERBOSE=1 make -j16
You can then get a bit closer to the problem with:
cd /root/rocFFT/build/library/src
gdb -ex r --args ./rocfft_aot_helper "" /root/rocFFT/build/library/src/rocfft_kernel_cache.db /root/rocFFT/build/library/src/rocfft_rtc_helper gfx906
It can also help to install debug packages:
# enable debug repo and install debug libs by changing 'enabled = 0' to `enabled = 1`
vim /etc/yum.repos.d/fedora-rawhide.repo
# install the debug packages
dnf install clang-libs-debuginfo elfutils-libelf-debuginfo glibc-debuginfo hsakmt-debuginfo libdrm-debuginfo libedit-debuginfo libffi-debuginfo libgcc-debuginfo libstdc++-debuginfo libzstd-debuginfo lld-libs-debuginfo llvm-libs-debuginfo numactl-libs-debuginfo rocm-comgr-debuginfo rocm-hip-debuginfo rocm-runtime-debuginfo sqlite-libs-debuginfo zlib-debuginfo ncurses-libs-debuginfo
# fix a missing directory or gdb will complain
mkdir /usr/src/debug/rocclr-5.5.1-9.fc39.x86_64/rocclr/cmake
There's too much that has been optimized out for me to understand what is wrong and I'm not very familiar with hiprtc.
Just as a data point, Debian is using rocFFT from 5.5.1, built with HIP from 5.2.3 and LLVM 15. It has not encountered this problem, so I suspect this might therefore be related to a change in a low-level component like HIP, comgr, or clang.
Interesting.. I just tried @cgmb's repro steps (i.e. with the latest rawhide as of today) and get different behaviour now that ROCm 5.6 has been released:
[100%] Compile default kernels and solution-map kernels into shipped cache file
terminate called after throwing an instance of 'std::runtime_error'
what(): In file included from <built-in>:1:
In file included from /builddir/build/BUILD/clr-rocm-5.6.0/redhat-linux-build/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:5:
In file included from /builddir/build/BUILD/clr-rocm-5.6.0/HIP-rocm-5.6.0/include/hip/hip_runtime.h:62:
In file included from /builddir/build/BUILD/clr-rocm-5.6.0/hipamd/include/hip/amd_detail/amd_hip_runtime.h:112:
In file included from /builddir/build/BUILD/clr-rocm-5.6.0/hipamd/include/hip/amd_detail/amd_hip_atomic.h:25:
/builddir/build/BUILD/clr-rocm-5.6.0/hipamd/include/hip/amd_detail/amd_device_functions.h:82:52: error: redefinition of '__ffsll'
__attribute__((device)) static inline unsigned int __ffsll(uint64_t input) {
^
/builddir/build/BUILD/clr-rocm-5.6.0/hipamd/include/hip/amd_detail/amd_device_functions.h:70:52: note: previous definition is here
__attribute__((device)) static inline unsigned int __ffsll(unsigned long long int input) {
^
1 error generated when compiling for gfx906.
make[2]: *** [library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make:74: library/src/rocfft_kernel_cache.db] Aborted (core dumped)
make[1]: *** [CMakeFiles/Makefile2:512: library/src/CMakeFiles/rocfft_kernel_cache_target.dir/all] Error 2
make: *** [Makefile:156: all] Error 2
So hipRTC in ROCm 5.6 is no longer trying to follow a null function pointer at least. Again, I'm not really familiar with hipRTC, but I also don't see any obvious changes in the clr commit log between ROCm 5.5 and 5.6 that would explain this.
This new error seems to be Fedora-specific, and caused by https://src.fedoraproject.org/rpms/rocclr/blob/rawhide/f/0001-add-uint64_t-variant-for-__ffsll.patch - ideally the reason for that patch has gone away with ROCm 5.6 and it can just be reverted. But I'm not familiar enough with the original problem to be sure.
Hi @tflink, do you have any update? Has the behaviour changed for you with ROCm 5.6?
Apologies for the delay, I've been messing with how I get notifications and managed to miss a few from github.
I see the same behavior you do when I build 5.6.0. I started patching out the kernel cache during the build when I learned that's how the debian package is handling the issue.
I rebuilt rocclr to not use the patch you linked to and while the runtime_error goes away, the build just hangs after [100%] Built target rocfft
with rocfft_aot_help
and rocfft_rtc_help
taking up almost all available CPU. I let this sit in that state for about 15 minutes before killing the processes manually under the assumption that it's not going to finish. I think I saw the the same behavior on RHEL 9.2 when I tried to build rocfft using the official AMD repos for the dependencies. I'll attempt to reproduce there to make sure.
If I patch out the kernel cache parts in the same way that the debian patch does, rocfft builds and doesn't hang at the end of the build process.
Is the best way to help you reproduce issues with container images?
Also, I spoke with the author of the uint64 patch you linked to (@trixrt) and as far as we know, it's still needed for blender.
I see the same behavior you do when I build 5.6.0. I started patching out the kernel cache during the build when I learned that's how the debian package is handling the issue.
I do not encounter this issue on Debian. rocfft 5.5.0-1 built successfully and did not include that patch. I only patched it out of the build because I had not yet figured out where the cache should be installed according to Debian policy, and it seemed wasteful to build the cache if it wasn't going to be installed.
I do not encounter this issue on Debian. rocfft 5.5.0-1 built successfully and did not include that patch. I only patched it out of the build because I had not yet figured out where the cache should be installed according to Debian policy, and it seemed wasteful to build the cache if it wasn't going to be installed.
Good to know, thanks
I'm getting the same issue with hanging at the end of the rocfft build on RHEL 9.2 using the AMD supplied packages for deps. Am I just being impatient? I left this build alone for longer, though. rocfft_aot_help
and rocfft_rtc_help
have the CPU pretty much pegged and it's been in this state for 45 minutes now.
The end of the build does take a very long time, as that's when we're compiling all of the kernels that we want to distribute with the library, for all enabled architectures. The optional ROCFFT_BUILD_KERNEL_CACHE_PATH CMake parameter that was the subject of #430 can help somewhat, by defining a place where these kernels are persistently stored between builds so that they can be reused. But it doesn't reduce the initial amount of work that the build would want to do.
rocFFT will be perfectly functional if the cache is not built or distributed with the library. But that means that all kernels will need to be compiled at runtime, since rocFFT will not find any that are distributed with it. Creation of FFT plans will take longer, but the plans will still work.
A middle ground might be to remove architectures from AMDGPU_TARGETS. Fewer architectures in that list means fewer kernels that need to be compiled at the end of the build. rocFFT will still work on an architecture that is not in this list, because it will still compile kernels on-demand at runtime.
Yeah, I didn't kill the build on RHEL and it turns out that it took about 1.5 hours to do that last part. I'll redo the fedora build locally and let it run. Unless something strange happens, it looks like we're going to have to deal with that patch and find some sort of fix.
After revering that patch, rocFFT builds and works fine on this system. It turns out that the unit_t-variant
patch was not only breaking compile but it was apparently the root cause of an issue that was keeping the library from working when compiled w/o cached kernels.
Thanks for the help with debugging this.
What is the expected behavior
What actually happens
How to reproduce
Environment
Tagging @Mystro256 as he has worked on packaging many of the dependencies here.
The Issue
Important note: I'm working to package ROCm in Fedora so I'm building upon the bits we already have packaged instead of using AMD's prebuilt packages or building everything in
/opt/rocm
.Before I get to the packaging part, I'm just trying to build rocFFT on my local machine. As part of this, I'm doing the following
The output I get on screen ends with
Using system gdb (not ROCgdb) to get the backtrace: