ROCm / rocFFT

Next generation FFT implementation for ROCm
https://rocm.docs.amd.com/projects/rocFFT/en/latest/
Other
176 stars 84 forks source link

Multiple plan create-execute-destory cycles leads to segfault #28

Closed sklam closed 7 years ago

sklam commented 7 years ago

What is the expected behavior

What actually happens

How to reproduce

Environment

Hardware description
GPU gfx803
CPU Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Software version
ROCm v1.3
HCC clang version 3.5.0 (based on HCC 0.10.16464-0779319-06c8b76 LLVM 3.5.0svn)
bragadeesh commented 7 years ago

Thanks for reporting this; I will look into this, are building 'release' or 'debug' version of library? On my side, I built debug version of lib and using hipcc with debug info turned on on the sample code, I am not seeing any seg-fault.

sklam commented 7 years ago

I used cmake -DBUILD_LIBRARY=ON -DCMAKE_BUILD_TYPE=RELEASE -DHIP_ROOT=/opt/rocm/hip .. && make to build but it still produces a debug output librocfft-hcc-d.so

sklam commented 7 years ago

Building the test.cpp from my gist with /opt/rocm/bin/hipcc -g -Ilibrary/include test.cpp -Lbuild/library-build/src -lrocfft-hcc-d also results in segfault.

GDB shows the following backtrace:

Program received signal SIGSEGV, Segmentation fault.
std::vector<unsigned long, std::allocator<unsigned long> >::size (this=0x10)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_vector.h:646
646       { return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }
(gdb) bt
#0  std::vector<unsigned long, std::allocator<unsigned long> >::size (this=0x10)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_vector.h:646
#1  0x00000000004bc43f in PrintNode (execPlan=...) at /home/amd_user/rocFFT/library/src/plan.cpp:1957
#2  0x00000000004c9ef9 in rocfft_execute (plan=0xea5c70, in_buffer=0x7fffffffe548, out_buffer=0x0, info=0x0)
    at /home/amd_user/rocFFT/library/src/transform.cpp:50
#3  0x00000000006c1be3 in work () at test.cpp:38
#4  0x00000000006c1e58 in main () at test.cpp:62

I am consistently seeing the this pointer for the std::vector becoming 0x10.

tingxingdong commented 7 years ago

upgrade to ROCM-1.3.1 first., just "sudo apt-get install rocm" then reboot machine.

bragadeesh commented 7 years ago

This turns out to be a library bug; i have checked in a fix in develop branch 9ab5505ab6eed4eef170c9b87417298d4cd21c99 please try that and let me know if it fixes on your end

pavanky commented 7 years ago

@bragadeesh A small suggestion,

If the result of repo.planUnique.find(*plan) is stored, you will not need to do a second lookup in the else part. You can simply do execLookup[plan]= iter->second;

kknox commented 7 years ago

@sklam With regard to the debug/release build issue, I am working on a significant build refactoring. I'm working in rocBLAS first, but will then port the changes down into rocfft which should fix a few issues.

tingxingdong commented 7 years ago

Hi, All

Now the rocFFT can build shared library. See the third commit of my pending pull request

https://github.com/RadeonOpenCompute/rocFFT/pull/29/commits/df94b6f2901f83500286c9b05fea7b6291bbd4b4

bragadeesh commented 7 years ago

@pavanky sure, i have included that change