charlesq34 / pointnet

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
Other
4.66k stars 1.44k forks source link

Tensorflow kernel killed automatically when running train.py in sem_seg #208

Open arunumd opened 4 years ago

arunumd commented 4 years ago

I am trying to run the semantic segmentation training script. I downloaded the data files using download_data.sh and then ran the train.py as stated in this README. But my train.py fails with the following message :

Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Const: GPU CPU XLA_CPU XLA_GPU 
AssignAdd: CPU 
Identity: GPU CPU XLA_CPU XLA_GPU 
VariableV2: CPU 
Assign: CPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  Variable (VariableV2) /device:GPU:0
  Variable/Assign (Assign) /device:GPU:0
  Variable/read (Identity) /device:GPU:0
  Adam/value (Const) /device:GPU:0
  Adam (AssignAdd) /device:GPU:0
  save/Assign (Assign) /device:GPU:0
.
.
.
.
.
.
**** EPOCH 000 ****
----
Current batch/total batch num: 0/845
2019-10-10 15:21:49.046829: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-10-10 15:21:49.425931: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
Killed

It seems like tensorflow consumes the maximum RAM and then the kernel gets killed always. Has anybody faced the same problem ?

arunumd commented 4 years ago

An Update :

I followed the suggestions given on this page and tried the following things to solve this problem :

  1. Change the batch size from 24 to lesser values (8, 2, 1, etc.)
  2. Change the default optimizer from adam to momentum

However, none of those suggestions helped me solve the problem. The kernel killed problem still persists. Any help will be appreciated.