charlesq34 / pointnet2

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Other
3.05k stars 893 forks source link

Segmentation Fault while training Semantic Scene Parsing (scannet) #178

Open mradwan80 opened 4 years ago

mradwan80 commented 4 years ago

I have Ubuntu 19.10, python 2.7.17, tensorflow 1.14.0, and cuda 10.2. When I run train.py, I get a segmentation fault message:

python train.py
pid: 5492
WARNING:tensorflow:From /home/mradwan/other-projects/pointnet2/scannet/pointnet2_sem_seg.py:12: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

Tensor("Placeholder_3:0", shape=(), dtype=bool, device=/device:GPU:0)
WARNING:tensorflow:From train.py:91: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From train.py:111: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

--- Get model and loss
WARNING:tensorflow:From /home/mradwan/other-projects/pointnet2/utils/pointnet_util.py:107: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

Segmentation fault (core dumped)

Extra details:

The contents of the sh files I used to compiled the TF operators are:

tf_interpolate_compile.sh:

g++ -std=c++11 tf_interpolate.cpp -o tf_interpolate_so.so -shared -fPIC -I /usr/local/lib/python2.7/dist-packages/tensorflow/include -I /usr/local/cuda-10.2/include -I /usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public -lcudart -L /usr/local/cuda-10.2/lib64/ -L/usr/local/lib/python2.7/dist-packages/tensorflow -ltensorflow_framework -O2 -D_GLIBCXX_USE_CXX11_ABI=0

tf_sampling_compile.sh:

/usr/local/cuda-10.2/bin/nvcc tf_sampling_g.cu -o tf_sampling_g.cu.o -c -O2 -DGOOGLE_CUDA=1 -x cu -Xcompiler -fPIC

g++ -std=c++11 tf_sampling.cpp tf_sampling_g.cu.o -o tf_sampling_so.so -shared -fPIC -I /usr/local/lib/python2.7/dist-packages/tensorflow/include -I /usr/local/cuda-10.2/include -I /usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public -lcudart -L /usr/local/cuda-10.2/lib64/ -L/usr/local/lib/python2.7/dist-packages/tensorflow -ltensorflow_framework -O2 -D_GLIBCXX_USE_CXX11_ABI=0

tf_grouping_compile.sh:

/usr/local/cuda-10.2/bin/nvcc tf_grouping_g.cu -o tf_grouping_g.cu.o -c -O2 -DGOOGLE_CUDA=1 -x cu -Xcompiler -fPIC

g++ -std=c++11 tf_grouping.cpp tf_grouping_g.cu.o -o tf_grouping_so.so -shared -fPIC -I /usr/local/lib/python2.7/dist-packages/tensorflow/include -I /usr/local/cuda-10.2/include -I /usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public -lcudart -L /usr/local/cuda-10.2/lib64/ -L/usr/local/lib/python2.7/dist-packages/tensorflow -ltensorflow_framework -O2 -D_GLIBCXX_USE_CXX11_ABI=0

I also needed to do some changes to make the compilation work: 1-copied /usr/local/lib/python2.7/dist-packages/tensorflow/libtensorflow_framework.so.1 to /usr/local/lib/python2.7/dist-packages/tensorflow/libtensorflow_framework.so 2-ran

pip install -U scikit-learn scipy matplotlib

3- copied the files plyfile.py, plyfile.pyc, eulerangles.py, eulerangle.pyc from pointnet/utils to pointnet2/scannet. 4-changed default='model' in the next line in train.py to default='pointnet2_sem_seg', and copied the files pointnet2_sem_seg.py and pointnet2_sem_seg.pyc from pointnet2/models to pointnet2/scannet.

parser.add_argument('--model', default='model', help='Model name [default: model]')

5- added sys.path.append(os.path.join(ROOT_DIR, 'models') in train.py, after sys.path.append(os.path.join(ROOT_DIR, 'utils'))

These steps helped getting the TF operators and train.py compiled, but then I get the error. Anyone has an idea why this happens?

mradwan80 commented 4 years ago

@charlesq34 @ericyi @suhaochina @rqi-nuro

Houssembenmid commented 3 years ago

have you found a solution I have the same issue

Stanfording commented 3 years ago

Mee too!

Stanfording commented 3 years ago

@charlesq34 @ericyi @suhaochina @rqi-nuro

It's possible that the memory needed for the PointNet++ architecture is larger than what you have. You can try to build the same thing on google collab and see if it works.