CharlesShang / TFFRCNN

FastER RCNN built on tensorflow
MIT License
874 stars 418 forks source link

Undefined symbol: _ZTIN10tensorflow8OpKernelE #108

Open GaryWooCN opened 6 years ago

GaryWooCN commented 6 years ago

Hi, I am running the master trunk and encounter the error when do training. Could anyone help on this? Thanks.

File "./faster_rcnn/../lib/roi_pooling_layer/roi_pooling_op.py", line 5, in _roi_pooling_module = tf.load_op_library(filename) File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename, status) File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: ./faster_rcnn/../lib/roi_pooling_layer/roi_pooling.so: undefined symbol: _ZTIN10tensorflow8OpKernelE

The ./lib/make.sh is as following:

!/usr/bin/env bash TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())') echo $TF_INC

CUDA_PATH=/usr/local/cuda-8.0/

cd roi_pooling_layer

nvcc -std=c++11 -c -o roi_pooling_op.cu.o roi_pooling_op_gpu.cu.cc \ -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -arch=sm_60

if you install tf using already-built binary, or gcc version 4.x, uncomment the two lines below

g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o roi_pooling.so roi_pooling_op.cc \ roi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64

for gcc5-built tf

g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=1 -o roi_pooling.so roi_pooling_op.cc \

roi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64

cd ..

add building psroi_pooling layer

cd psroi_pooling_layer nvcc -std=c++11 -c -o psroi_pooling_op.cu.o psroi_pooling_op_gpu.cu.cc \ -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -arch=sm_60

g++ -std=c++11 -shared -o psroi_pooling.so psroi_pooling_op.cc \

psroi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64

if you install tf using already-built binary, or gcc version 4.x, uncomment the two lines below

g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o psroi_pooling.so psroi_pooling_op.cc \ psroi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64

cd ..

freeksg66 commented 6 years ago

https://github.com/tensorflow/tensorflow/issues/13607 I use this issue and fixed it.

Kongsea commented 6 years ago

I encountered exactly this error too. Have you solved it now?

Kongsea commented 6 years ago

I have downloaded roi_pooling.so from https://github.com/CharlesShang/TFFRCNN/blob/roi_pooling/lib/roi_pooling_layer/roi_pooling.so and replaced my compiled roi_pooling.so according to @CharlesShang . It encountered another error: tensorflow.python.framework.errors_impl.NotFoundError: faster_rcnn/../lib/roi_pooling_layer/roi_pooling.so: invalid ELF header

Kongsea commented 6 years ago

I finally downgraded tensorflow from 1.4 to 1.3 and added -D_GLIBCXX_USE_CXX11_ABI=0, then this problem was solved.

yh284914425 commented 6 years ago

where to add -D_GLIBCXX_USE_CXX11_ABI=0? and I use tensorflow_gpu-1.4.0-cp27-none-linux_x86_64.whl and my gcc version is 5.4.0 。The ./lib/make.sh is as following.How should the file be modified? Can you help me? Thanks

`#!/usr/bin/env bash TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())') echo $TF_INC

CUDA_PATH=/usr/local/cuda/

cd roi_pooling_layer

nvcc -std=c++11 -c -o roi_pooling_op.cu.o roi_pooling_op_gpu.cu.cc \ -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -arch=sm_52

if you install tf using already-built binary, or gcc version 4.x, uncomment the two lines below

g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o roi_pooling.so roi_pooling_op.cc \

roi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64

for gcc5-built tf

g++ -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc \ roi_pooling_op.cu.o -I $TF_INC -D GOOGLE_CUDA=1 -fPIC $CXXFLAGS -D_GLIBCXX_USE_CXX11_ABI=0 \ -lcudart -L $CUDA_PATH/lib64 cd ..

add building psroi_pooling layer

cd psroi_pooling_layer nvcc -std=c++11 -c -o psroi_pooling_op.cu.o psroi_pooling_op_gpu.cu.cc \ -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -arch=sm_52

g++ -std=c++11 -shared -o psroi_pooling.so psroi_pooling_op.cc \ psroi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64

if you install tf using already-built binary, or gcc version 4.x, uncomment the two lines below

g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o psroi_pooling.so psroi_pooling_op.cc \

psroi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64

cd ..` @Kongsea

yh284914425 commented 6 years ago

I don't know which places need to be annotated, and those places need to be modified.please help me @Kongsea

Kongsea commented 6 years ago

Downgrade your tensorflow to r1.3.

Try to modify this line g++ -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc roi_pooling_op.cu.o -I $TF_INC -D GOOGLE_CUDA=1 -fPIC $CXXFLAGS -D_GLIBCXX_USE_CXX11_ABI=0 -lcudart -L $CUDA_PATH/lib64

to

g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o roi_pooling.so roi_pooling_op.cc roi_pooling_op.cu.o -I $TF_INC -D GOOGLE_CUDA=1 -fPIC $CXXFLAGS -D_GLIBCXX_USE_CXX11_ABI=0 -lcudart -L $CUDA_PATH/lib64

selinachenxi commented 6 years ago

It doesn't have to downgrade to 1.3. I am using 1.4 with gcc 5.4. In make.sh file, add TF_LIB=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())') at the beginning, then add -L $TF_LIB -ltensorflow_framework behind -L $CUDA_PATH/lib64 re make, it works.

zhangweilion commented 6 years ago

@selinachenxi
my tensorflow is 1.4 gcc 5.4 I modify the make.sh , just below, and it doesn't work

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')

adding by zw

TF_LIB=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())')

end adding by zw

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')

adding by zw

TF_LIB=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())')

end adding by zw

CUDA_PATH=/usr/local/cuda/ CXXFLAGS=''

if [[ "$OSTYPE" =~ ^darwin ]]; then CXXFLAGS+='-undefined dynamic_lookup' fi

cd roi_pooling_layer

if [ -d "$CUDA_PATH" ]; then nvcc -std=c++11 -c -o roi_pooling_op.cu.o roi_pooling_op_gpu.cu.cc \ -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC $CXXFLAGS \ -arch=sm_37

g++ -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc \
    roi_pooling_op.cu.o -I $TF_INC  -D GOOGLE_CUDA=1 -fPIC $CXXFLAGS \
    -lcudart -L $TF_LIB -ltensorflow_framework -L $CUDA_PATH/lib64

else g++ -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc \ -I $TF_INC -fPIC $CXXFLAGS fi

cd ..

Kongsea commented 6 years ago

This bash works:

#!/usr/bin/env bash
TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')
TF_LIB=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())')
echo $TF_INC

CUDA_PATH=/usr/local/cuda/

cd roi_pooling_layer

nvcc -std=c++11 -c -o roi_pooling_op.cu.o roi_pooling_op_gpu.cu.cc \
    -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -arch=sm_61

g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o roi_pooling.so roi_pooling_op.cc \
    roi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64 -L $TF_LIB -ltensorflow_framework

cd ..

cd psroi_pooling_layer

nvcc -std=c++11 -c -o psroi_pooling_op.cu.o psroi_pooling_op_gpu.cu.cc \
    -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -arch=sm_61

g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o psroi_pooling.so psroi_pooling_op.cc \
    psroi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64 -L $TF_LIB -ltensorflow_framework

cd ..
xmeng525 commented 5 years ago

I had similar problem because of namespace. I changed my "new_op.cu.cc" from

namespace tensorflow{
// my code
}

to

using namespace tensorflow;
// my code

and it is fixed.

vllsm commented 5 years ago

It doesn't have to downgrade to 1.3. I am using 1.4 with gcc 5.4. In make.sh file, add TF_LIB=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())') at the beginning, then add -L $TF_LIB -ltensorflow_framework behind -L $CUDA_PATH/lib64 re make, it works.

THX so much

leavewave commented 5 years ago

I had similar problem because of namespace. I changed my "new_op.cu.cc" from

namespace tensorflow{
// my code
}

to

using namespace tensorflow;
// my code

and it is fixed.

hi, where is this file? i cannot find it.

helinwang commented 3 years ago

I ran into similar issue, the problem was I manually compiled TF and tries to load another TF operator library. The problem is due two the two *.so files are compiled by different ABI. The fix for me was compiling my custom TF with --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"

E.g.,

bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --config=v2 --copt=-mavx --copt=-msse4.2 //tensorflow/tools/pip_package:build_pip_package 
ArmageddonKnight commented 2 years ago

Adding this linking option works for me: -Wl,--no-as-needed

Reference: https://stackoverflow.com/questions/48189818/undefined-symbol-ztin10tensorflow8opkernele

FeiDao7943 commented 2 years ago

I just avoid this issue in change version of g++, gcc, TF, and CUDA. It works on both colab and physical computers. You can try in this environment, that seems not so reasonable but effective.

Ubuntu 18.04.5 LTS tensorflow-gpu==1.13.1 numpy==1.16.0 (this might be the key) gcc (Ubuntu 5.5.0-12ubuntu1) 5.5.0 g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 CUDA 10.0

And "-D_GLIBCXX_USE_CXX11_ABI=0" in the "tf_xxxx_complie.sh" should be deleted

Brunda02 commented 1 year ago

My current environment is tensorflow-gpu==1.13.1 gcc==7.5.0 CUDA=10.0 I am getting the same error . Can anyone suggest which environment I should use

FeiDao7943 commented 1 year ago

@Brunda02 : I hope this list is useful for you, especially the different place with yours. By the way, this environment is tested on the Google Colab and my PC, I am not so sure that it can work on other machine.

List: Ubuntu 18.04.5 LTS tensorflow-gpu==1.13.1 numpy==1.16.0 (this might be the key) gcc (Ubuntu 5.5.0-12ubuntu1) 5.5.0 g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 CUDA 10.0

And "-D_GLIBCXX_USE_CXX11_ABI=0" in the "tf_xxxx_complie.sh" should be deleted

Brunda02 commented 1 year ago

@FeiDao7943 what is tf_xxxx_complie.sh?

FeiDao7943 commented 1 year ago

@Brunda02 tf_xxxx_complie.sh total 3 files. In location: ./frustum-pointnets-master/models/tf_ops/ there are 3 folders, and there is a file named tf_xxxx_complie.sh in each folder, which xxxx is the name of the folder. And each folder just has only one .sh file.

And "-D_GLIBCXX_USE_CXX11_ABI=0" in the tf_xxxx_complie.sh should be deleted, if not exist then ignore it.