ROCm / rocFFT

Next generation FFT implementation for ROCm
https://rocm.docs.amd.com/projects/rocFFT/en/latest/
Other
175 stars 84 forks source link

segfault during build of rocFFT on Fedora #422

Closed tflink closed 1 year ago

tflink commented 1 year ago

What is the expected behavior

What actually happens

How to reproduce

Environment

Hardware description
GPU Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] (rev 06)
CPU AMD Ryzen 7 5700X 8-Core Processor
Software version
ROCK no binary driver, kernel 6.4.0-0.rc6.20230614gitb6dad5178cea (fedora package)
ROCR v5.5.0 (fedora package)
HCC v5.5.1 (fedora package)
Library v16.2 (fedora package built from fork)

Tagging @Mystro256 as he has worked on packaging many of the dependencies here.

The Issue

Important note: I'm working to package ROCm in Fedora so I'm building upon the bits we already have packaged instead of using AMD's prebuilt packages or building everything in /opt/rocm.

Before I get to the packaging part, I'm just trying to build rocFFT on my local machine. As part of this, I'm doing the following

git checkout release/rocm-rel-5.5
mkdir build
cd build
cmake .. -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_C_COMPILER=hipcc -DSQLITE_USE_SYSTEM_PACKAGE=ON -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Debug
make -d

The output I get on screen ends with

Reading makefile 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make'...
Reading makefile 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/compiler_depend.make' (search path) (no ~ expansion)...
Reading makefile 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/progress.make' (search path) (no ~ expansion)...
Updating makefiles....
 Considering target file 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make'.
  Looking for an implicit rule for 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make'.
  No implicit rule found for 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make'.
 Finished prerequisites of target file 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make'.
 No need to remake target 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make'.
 Considering target file 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/compiler_depend.make'.
  Looking for an implicit rule for 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/compiler_depend.make'.
  No implicit rule found for 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/compiler_depend.make'.
 Finished prerequisites of target file 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/compiler_depend.make'.
 No need to remake target 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/compiler_depend.make'.
 Considering target file 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/progress.make'.
  Looking for an implicit rule for 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/progress.make'.
  No implicit rule found for 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/progress.make'.
 Finished prerequisites of target file 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/progress.make'.
 No need to remake target 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/progress.make'.
Updating goal targets....
Considering target file 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build'.
 File 'library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build' does not exist.
 Considering target file 'rocfft_kernel_cache_target'.
  File 'rocfft_kernel_cache_target' does not exist.
  Considering target file 'library/src/CMakeFiles/rocfft_kernel_cache_target'.
   File 'library/src/CMakeFiles/rocfft_kernel_cache_target' does not exist.
   Looking for an implicit rule for 'library/src/CMakeFiles/rocfft_kernel_cache_target'.
   No implicit rule found for 'library/src/CMakeFiles/rocfft_kernel_cache_target'.
   Considering target file 'library/src/rocfft_kernel_cache.db'.
    File 'library/src/rocfft_kernel_cache.db' does not exist.
    Considering target file 'library/src/rocfft_rtc_helper'.
     Looking for an implicit rule for 'library/src/rocfft_rtc_helper'.
     No implicit rule found for 'library/src/rocfft_rtc_helper'.
    Finished prerequisites of target file 'library/src/rocfft_rtc_helper'.
    No need to remake target 'library/src/rocfft_rtc_helper'.
    Considering target file 'library/src/rocfft_aot_helper'.
     Looking for an implicit rule for 'library/src/rocfft_aot_helper'.
     No implicit rule found for 'library/src/rocfft_aot_helper'.
    Finished prerequisites of target file 'library/src/rocfft_aot_helper'.
    No need to remake target 'library/src/rocfft_aot_helper'.
   Finished prerequisites of target file 'library/src/rocfft_kernel_cache.db'.
   Must remake target 'library/src/rocfft_kernel_cache.db'.
library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make:73: update target 'library/src/rocfft_kernel_cache.db' due to: target does not exist
/usr/bin/cmake -E cmake_echo_color "--switch=" --blue --bold --progress-dir=/home/tflink/rocm/rocFFT/build/CMakeFiles --progress-num=48 "Compile kernels into shipped cache file"
Putting child 0x55b75d6d6d20 (library/src/rocfft_kernel_cache.db) PID 2800333 on the chain.
Live child 0x55b75d6d6d20 (library/src/rocfft_kernel_cache.db) PID 2800333
[100%] Compile kernels into shipped cache file
Reaping winning child 0x55b75d6d6d20 PID 2800333
cd /home/tflink/rocm/rocFFT/build/library/src && ./rocfft_aot_helper "" /home/tflink/rocm/rocFFT/build/library/src/rocfft_kernel_cache.db /home/tflink/rocm/rocFFT/build/library/src/rocfft_rtc_helper gfx906 gfx908 gfx90a gfx1030 gfx1100 gfx1101 gfx1102
Live child 0x55b75d6d6d20 (library/src/rocfft_kernel_cache.db) PID 2800334
Reaping losing child 0x55b75d6d6d20 PID 2800334
make[2]: *** [library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make:74: library/src/rocfft_kernel_cache.db] Segmentation fault (core dumped)
Removing child 0x55b75d6d6d20 PID 2800334 from chain.
Reaping losing child 0x556ff224f9c0 PID 2800332
make[1]: *** [CMakeFiles/Makefile2:455: library/src/CMakeFiles/rocfft_kernel_cache_target.dir/all] Error 2
Removing child 0x556ff224f9c0 PID 2800332 from chain.
Reaping losing child 0x5604d1b40ba0 PID 2795876
make: *** [Makefile:156: all] Error 2
Removing child 0x5604d1b40ba0 PID 2795876 from chain.

Using system gdb (not ROCgdb) to get the backtrace:

Thread 16 "rocfft_aot_help" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe23f16c0 (LWP 2800618)]
0x0000000000000000 in ?? ()
:(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7de1e7c in amd::Comgr::create_data_set (data_set=<optimized out>, data_set=<optimized out>)
    at /usr/src/debug/rocclr-5.5.1-8.fc39.x86_64/rocclr/cmake/../device/comgrctx.hpp:197
#2  hiprtc::RTCProgram::RTCProgram (this=<optimized out>, name=..., this=<optimized out>, name=...) at /usr/src/debug/rocclr-5.5.1-8.fc39.x86_64/hipamd/src/hiprtc/hiprtcInternal.cpp:38
#3  0x00007ffff7dcd75b in hiprtc::RTCCompileProgram::RTCCompileProgram (name_="rocfft_rtc.cu", this=0x7fffd0245ff0)
    at /usr/src/debug/rocclr-5.5.1-8.fc39.x86_64/hipamd/src/hiprtc/hiprtcInternal.cpp:98
#4  hiprtcCreateProgram (prog=0x7fffe23f05c0,
    src=0x7fffd0135f80 "#define ROCFFT_CALLBACKS_ENABLED\n\n// Copyright (C) 2016 - 2022 Advanced Micro Devices, Inc. All rights reserved.\n//\n// Permission is hereby granted, free of charge, to any person obtaining a copy\n// o"..., name=0x66a385 "rocfft_rtc.cu", numHeaders=0, headers=0x0, headerNames=0x0)
    at /usr/src/debug/rocclr-5.5.1-8.fc39.x86_64/hipamd/src/hiprtc/hiprtc.cpp:86
#5  0x00000000005d9aa0 in compile_inprocess (
    kernel_src="#define ROCFFT_CALLBACKS_ENABLED\n\n// Copyright (C) 2016 - 2022 Advanced Micro Devices, Inc. All rights reserved.\n//\n// Permission is hereby granted, free of charge, to any person obtaining a copy\n// o"..., gpu_arch="gfx906") at /home/tflink/rocm/rocFFT/library/src/rtc_compile.cpp:30
#6  0x000000000041163d in cached_compile (kernel_name="fft_rtc_fwd_len512_sp_op_CI_CI_sbrc_erc_z_xy_aligned_CB", gpu_arch_with_flags="gfx906", generate_src=..., generator_sum=...)
    at /home/tflink/rocm/rocFFT/library/src/rtc_cache.cpp:496
#7  0x000000000040dbe2 in main::{lambda()#1}::operator()() const (this=<optimized out>) at /home/tflink/rocm/rocFFT/library/src/rocfft_aot_helper.cpp:469
#8  std::__invoke_impl<void, main::{lambda()#1}>(std::__invoke_other, main::{lambda()#1}&&) (__f=...)
    at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
#9  std::__invoke<main::{lambda()#1}>(main::{lambda()#1}&&) (__fn=...) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:96
#10 std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=<optimized out>)
    at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_thread.h:292
#11 std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::operator()() (this=<optimized out>)
    at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_thread.h:299
#12 std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#1}> > >::_M_run() (this=0x868030)
    at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_thread.h:244
#13 0x00007ffff62e31d3 in std::execute_native_thread_routine (__p=0x868030) at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
#14 0x00007ffff5e8d777 in start_thread (arg=<optimized out>) at pthread_create.c:444
#15 0x00007ffff5f1449c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
evetsso commented 1 year ago

Hmm.. I'm not super familiar with this component, but it looks like hipRTC is trying to call a null function pointer in libamd_comgr.so. Is libamd_comgr.so present in the default library search path when this is run? Is it loaded, according to gdb's info sharedlibrary command when the crash occurs?

tflink commented 1 year ago

libamd_comgr.so is in /usr/lib64 which is part of the default search path on this system

# ll /usr/lib64/libamd_comgr*
lrwxrwxrwx. 1 root root      17 May 24 18:00 /usr/lib64/libamd_comgr.so -> libamd_comgr.so.2
lrwxrwxrwx. 1 root root      21 May 24 18:00 /usr/lib64/libamd_comgr.so.2 -> libamd_comgr.so.2.5.0
-rwxr-xr-x. 1 root root 9484672 May 24 18:00 /usr/lib64/libamd_comgr.so.2.5.0

Is it loaded, according to gdb's info sharedlibrary command when the crash occurs?

:(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007ffff7fcb000  0x00007ffff7ff0eb5  Yes         /lib64/ld-linux-x86-64.so.2
0x00007ffff7e78970  0x00007ffff7f8207b  Yes         /lib64/libsqlite3.so.0
0x00007ffff7d8c6a0  0x00007ffff7dea598  Yes         /lib64/libhiprtc.so.5
0x00007ffff661e230  0x00007ffff6960468  Yes         /lib64/libamdhip64.so.5
0x00007ffff62a4590  0x00007ffff63c5d22  Yes         /lib64/libstdc++.so.6
0x00007ffff7cb63d0  0x00007ffff7d2a8f8  Yes         /lib64/libm.so.6
0x00007ffff65df6d0  0x00007ffff65f9f75  Yes         /lib64/libgcc_s.so.1
0x00007ffff5e26780  0x00007ffff5f83d8d  Yes         /lib64/libc.so.6
0x00007ffff65c5630  0x00007ffff65d317b  Yes         /lib64/libz.so.1
0x00007ffff540e4d0  0x00007ffff5463f72  Yes         /lib64/libamd_comgr.so.2
0x00007ffff5017070  0x00007ffff50d2ac9  Yes         /lib64/libhsa-runtime64.so.1
0x00007ffff65b7c10  0x00007ffff65bd035  Yes         /lib64/libnuma.so.1
0x00007ffff4c4e090  0x00007ffff4e2056a  Yes         /lib64/liblldELF.so.16
0x00007ffff6591010  0x00007ffff65aa32f  Yes         /lib64/liblldCommon.so.16
0x00007ffff1367510  0x00007ffff3f33417  Yes         /lib64/libclang-cpp.so.16
0x00007fffe9ec0840  0x00007fffed568622  Yes         /lib64/libLLVM-16.so
0x00007ffff6562b10  0x00007ffff6574950  Yes         /lib64/libhsakmt.so.1
0x00007ffff65447b0  0x00007ffff6556495  Yes         /lib64/libelf.so.1
0x00007ffff652eb90  0x00007ffff6539681  Yes         /lib64/libdrm.so.2
0x00007ffff651f5b0  0x00007ffff652514a  Yes         /lib64/libffi.so.8
0x00007ffff64ea8d0  0x00007ffff6509bf4  Yes         /lib64/libedit.so.0
0x00007ffff64ba670  0x00007ffff64cc8a4  Yes         /lib64/libtinfo.so.6
0x00007ffff64a1830  0x00007ffff64a61c1  Yes         /lib64/libdrm_amdgpu.so.1
0x00007ffff6149e40  0x00007ffff61f01f2  Yes         /lib64/libzstd.so.1
Mystro256 commented 1 year ago

@cgmb FYI

cgmb commented 1 year ago

@evetsso, if it helps, these are complete steps to reproduce the issue in a docker image:

$ docker run -it --rm --device=/dev/dri --device=/dev/kfd --security-opt seccomp=unconfined fedora:rawhide
dnf upgrade
dnf install rocm-hip-devel rocrand-devel sqlite-devel rocm-comgr-devel hsakmt-devel rocm-runtime-devel git vim cmake lld clang clang-libs clang-resource-filesystem clang-tools-extra llvm gdb
git clone https://github.com/ROCmSoftwarePlatform/rocFFT.git
cd rocFFT
git checkout release/rocm-rel-5.5
mkdir build
cd build
cmake .. -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_C_COMPILER=hipcc -DSQLITE_USE_SYSTEM_PACKAGE=ON -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Debug
VERBOSE=1 make -j16

You can then get a bit closer to the problem with:

cd /root/rocFFT/build/library/src
gdb -ex r --args ./rocfft_aot_helper "" /root/rocFFT/build/library/src/rocfft_kernel_cache.db /root/rocFFT/build/library/src/rocfft_rtc_helper gfx906 

It can also help to install debug packages:

# enable debug repo and install debug libs by changing 'enabled = 0' to `enabled = 1`
vim /etc/yum.repos.d/fedora-rawhide.repo

# install the debug packages
dnf install clang-libs-debuginfo elfutils-libelf-debuginfo glibc-debuginfo hsakmt-debuginfo libdrm-debuginfo libedit-debuginfo libffi-debuginfo libgcc-debuginfo libstdc++-debuginfo libzstd-debuginfo lld-libs-debuginfo llvm-libs-debuginfo numactl-libs-debuginfo rocm-comgr-debuginfo rocm-hip-debuginfo rocm-runtime-debuginfo sqlite-libs-debuginfo zlib-debuginfo ncurses-libs-debuginfo

# fix a missing directory or gdb will complain
mkdir /usr/src/debug/rocclr-5.5.1-9.fc39.x86_64/rocclr/cmake

There's too much that has been optimized out for me to understand what is wrong and I'm not very familiar with hiprtc.

cgmb commented 1 year ago

Just as a data point, Debian is using rocFFT from 5.5.1, built with HIP from 5.2.3 and LLVM 15. It has not encountered this problem, so I suspect this might therefore be related to a change in a low-level component like HIP, comgr, or clang.

evetsso commented 1 year ago

Interesting.. I just tried @cgmb's repro steps (i.e. with the latest rawhide as of today) and get different behaviour now that ROCm 5.6 has been released:

[100%] Compile default kernels and solution-map kernels into shipped cache file
terminate called after throwing an instance of 'std::runtime_error'
  what():  In file included from <built-in>:1:
In file included from /builddir/build/BUILD/clr-rocm-5.6.0/redhat-linux-build/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:5:
In file included from /builddir/build/BUILD/clr-rocm-5.6.0/HIP-rocm-5.6.0/include/hip/hip_runtime.h:62:
In file included from /builddir/build/BUILD/clr-rocm-5.6.0/hipamd/include/hip/amd_detail/amd_hip_runtime.h:112:
In file included from /builddir/build/BUILD/clr-rocm-5.6.0/hipamd/include/hip/amd_detail/amd_hip_atomic.h:25:
/builddir/build/BUILD/clr-rocm-5.6.0/hipamd/include/hip/amd_detail/amd_device_functions.h:82:52: error: redefinition of '__ffsll'
__attribute__((device)) static inline unsigned int __ffsll(uint64_t input) {
                                                   ^
/builddir/build/BUILD/clr-rocm-5.6.0/hipamd/include/hip/amd_detail/amd_device_functions.h:70:52: note: previous definition is here
__attribute__((device)) static inline unsigned int __ffsll(unsigned long long int input) {
                                                   ^
1 error generated when compiling for gfx906.

make[2]: *** [library/src/CMakeFiles/rocfft_kernel_cache_target.dir/build.make:74: library/src/rocfft_kernel_cache.db] Aborted (core dumped)
make[1]: *** [CMakeFiles/Makefile2:512: library/src/CMakeFiles/rocfft_kernel_cache_target.dir/all] Error 2
make: *** [Makefile:156: all] Error 2

So hipRTC in ROCm 5.6 is no longer trying to follow a null function pointer at least. Again, I'm not really familiar with hipRTC, but I also don't see any obvious changes in the clr commit log between ROCm 5.5 and 5.6 that would explain this.

This new error seems to be Fedora-specific, and caused by https://src.fedoraproject.org/rpms/rocclr/blob/rawhide/f/0001-add-uint64_t-variant-for-__ffsll.patch - ideally the reason for that patch has gone away with ROCm 5.6 and it can just be reverted. But I'm not familiar enough with the original problem to be sure.

evetsso commented 1 year ago

Hi @tflink, do you have any update? Has the behaviour changed for you with ROCm 5.6?

tflink commented 1 year ago

Apologies for the delay, I've been messing with how I get notifications and managed to miss a few from github.

I see the same behavior you do when I build 5.6.0. I started patching out the kernel cache during the build when I learned that's how the debian package is handling the issue.

I rebuilt rocclr to not use the patch you linked to and while the runtime_error goes away, the build just hangs after [100%] Built target rocfft with rocfft_aot_help and rocfft_rtc_help taking up almost all available CPU. I let this sit in that state for about 15 minutes before killing the processes manually under the assumption that it's not going to finish. I think I saw the the same behavior on RHEL 9.2 when I tried to build rocfft using the official AMD repos for the dependencies. I'll attempt to reproduce there to make sure.

If I patch out the kernel cache parts in the same way that the debian patch does, rocfft builds and doesn't hang at the end of the build process.

Is the best way to help you reproduce issues with container images?

tflink commented 1 year ago

Also, I spoke with the author of the uint64 patch you linked to (@trixrt) and as far as we know, it's still needed for blender.

cgmb commented 1 year ago

I see the same behavior you do when I build 5.6.0. I started patching out the kernel cache during the build when I learned that's how the debian package is handling the issue.

I do not encounter this issue on Debian. rocfft 5.5.0-1 built successfully and did not include that patch. I only patched it out of the build because I had not yet figured out where the cache should be installed according to Debian policy, and it seemed wasteful to build the cache if it wasn't going to be installed.

tflink commented 1 year ago

I do not encounter this issue on Debian. rocfft 5.5.0-1 built successfully and did not include that patch. I only patched it out of the build because I had not yet figured out where the cache should be installed according to Debian policy, and it seemed wasteful to build the cache if it wasn't going to be installed.

Good to know, thanks

tflink commented 1 year ago

I'm getting the same issue with hanging at the end of the rocfft build on RHEL 9.2 using the AMD supplied packages for deps. Am I just being impatient? I left this build alone for longer, though. rocfft_aot_help and rocfft_rtc_help have the CPU pretty much pegged and it's been in this state for 45 minutes now.

evetsso commented 1 year ago

The end of the build does take a very long time, as that's when we're compiling all of the kernels that we want to distribute with the library, for all enabled architectures. The optional ROCFFT_BUILD_KERNEL_CACHE_PATH CMake parameter that was the subject of #430 can help somewhat, by defining a place where these kernels are persistently stored between builds so that they can be reused. But it doesn't reduce the initial amount of work that the build would want to do.

rocFFT will be perfectly functional if the cache is not built or distributed with the library. But that means that all kernels will need to be compiled at runtime, since rocFFT will not find any that are distributed with it. Creation of FFT plans will take longer, but the plans will still work.

A middle ground might be to remove architectures from AMDGPU_TARGETS. Fewer architectures in that list means fewer kernels that need to be compiled at the end of the build. rocFFT will still work on an architecture that is not in this list, because it will still compile kernels on-demand at runtime.

tflink commented 1 year ago

Yeah, I didn't kill the build on RHEL and it turns out that it took about 1.5 hours to do that last part. I'll redo the fedora build locally and let it run. Unless something strange happens, it looks like we're going to have to deal with that patch and find some sort of fix.

tflink commented 1 year ago

After revering that patch, rocFFT builds and works fine on this system. It turns out that the unit_t-variant patch was not only breaking compile but it was apparently the root cause of an issue that was keeping the library from working when compiled w/o cached kernels.

Thanks for the help with debugging this.