QingyongHu / SensatUrban

🔥Urban-scale point cloud dataset (CVPR 2021 & IJCV 2022)
MIT License
499 stars 58 forks source link

Training stopped suddenly #27

Open Takenokono opened 3 years ago

Takenokono commented 3 years ago

Hi, I have a problem when I try to train the Network. During training, processing stopped suddenly and I got a message bellow.

Is there anyone who have same problem or solve it? Thanks.

(SensatUrban) [test]% python main_SensatUrban.py --mode train --gpu 0
birmingham_block_0_KDTree.pkl 50.8 MB loaded in 0.1s
birmingham_block_1_KDTree.pkl 135.1 MB loaded in 0.2s
birmingham_block_10_KDTree.pkl 28.4 MB loaded in 0.1s
birmingham_block_11_KDTree.pkl 5.7 MB loaded in 0.0s
birmingham_block_12_KDTree.pkl 26.0 MB loaded in 0.0s
birmingham_block_13_KDTree.pkl 5.3 MB loaded in 0.0s
birmingham_block_2_KDTree.pkl 96.5 MB loaded in 0.1s
birmingham_block_3_KDTree.pkl 164.2 MB loaded in 0.3s
birmingham_block_4_KDTree.pkl 186.0 MB loaded in 0.4s
birmingham_block_5_KDTree.pkl 183.9 MB loaded in 0.3s
birmingham_block_6_KDTree.pkl 39.8 MB loaded in 0.1s
birmingham_block_7_KDTree.pkl 45.6 MB loaded in 0.1s
birmingham_block_8_KDTree.pkl 161.3 MB loaded in 0.2s
birmingham_block_9_KDTree.pkl 167.8 MB loaded in 0.3s
cambridge_block_0_KDTree.pkl 0.6 MB loaded in 0.0s
cambridge_block_1_KDTree.pkl 0.5 MB loaded in 0.0s
cambridge_block_10_KDTree.pkl 185.9 MB loaded in 0.3s
cambridge_block_12_KDTree.pkl 244.0 MB loaded in 0.5s
cambridge_block_14_KDTree.pkl 287.3 MB loaded in 0.6s
cambridge_block_15_KDTree.pkl 268.0 MB loaded in 0.4s
cambridge_block_16_KDTree.pkl 259.5 MB loaded in 0.3s
cambridge_block_17_KDTree.pkl 44.5 MB loaded in 0.1s
cambridge_block_18_KDTree.pkl 68.9 MB loaded in 0.1s
cambridge_block_19_KDTree.pkl 188.2 MB loaded in 0.4s
cambridge_block_2_KDTree.pkl 255.5 MB loaded in 0.5s
cambridge_block_20_KDTree.pkl 270.1 MB loaded in 0.5s
cambridge_block_21_KDTree.pkl 227.3 MB loaded in 0.5s
cambridge_block_22_KDTree.pkl 204.2 MB loaded in 0.3s
cambridge_block_23_KDTree.pkl 17.6 MB loaded in 0.0s
cambridge_block_25_KDTree.pkl 44.7 MB loaded in 0.1s
cambridge_block_26_KDTree.pkl 200.1 MB loaded in 0.4s
cambridge_block_27_KDTree.pkl 236.9 MB loaded in 0.3s
cambridge_block_28_KDTree.pkl 120.8 MB loaded in 0.2s
cambridge_block_3_KDTree.pkl 237.3 MB loaded in 0.5s
cambridge_block_32_KDTree.pkl 5.5 MB loaded in 0.0s
cambridge_block_33_KDTree.pkl 62.9 MB loaded in 0.1s
cambridge_block_34_KDTree.pkl 5.7 MB loaded in 0.0s
cambridge_block_4_KDTree.pkl 25.4 MB loaded in 0.0s
cambridge_block_6_KDTree.pkl 204.6 MB loaded in 0.4s
cambridge_block_7_KDTree.pkl 295.8 MB loaded in 0.4s
cambridge_block_9_KDTree.pkl 296.7 MB loaded in 0.6s

Preparing reprojected indices for testing
birmingham_block_1 done in 0.2s
birmingham_block_2 done in 0.1s
birmingham_block_5 done in 0.3s
birmingham_block_8 done in 0.2s
cambridge_block_10 done in 0.3s
cambridge_block_15 done in 0.4s
cambridge_block_16 done in 0.4s
cambridge_block_22 done in 0.3s
cambridge_block_27 done in 0.3s
cambridge_block_7 done in 0.4s
Initiating input pipelines
WARNING:tensorflow:From /home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py:494: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.

WARNING:tensorflow:From main_SensatUrban.py:234: The name tf.data.Iterator is deprecated. Please use tf.compat.v1.data.Iterator instead.

WARNING:tensorflow:From main_SensatUrban.py:234: DatasetV1.output_types (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(dataset)`.
WARNING:tensorflow:From main_SensatUrban.py:234: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
WARNING:tensorflow:From /home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py:348: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
WARNING:tensorflow:From /home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py:349: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
WARNING:tensorflow:From /home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py:351: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:30: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:43: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:106: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From /home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:107: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).
WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/tf_util.py:49: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/tf_util.py:54: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead.

WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/tf_util.py:22: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/util/dispatch.py:180: batch_gather (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims` instead.
WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/tf_util.py:572: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:66: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:81: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:89: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:94: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:98: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:99: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

****EPOCH 0****
Step 00000050 L_out=6.323 Acc=0.36 ---  636.80 ms/batch
Step 00000100 L_out=5.793 Acc=0.55 ---  590.22 ms/batch
Step 00000150 L_out=5.148 Acc=0.52 ---  617.91 ms/batch
Step 00000200 L_out=3.836 Acc=0.52 ---  593.99 ms/batch
Step 00000250 L_out=2.575 Acc=0.50 ---  624.92 ms/batch
Step 00000300 L_out=4.704 Acc=0.53 ---  596.01 ms/batch
Step 00000350 L_out=2.949 Acc=0.49 ---  593.54 ms/batch
Step 00000400 L_out=4.432 Acc=0.41 ---  591.37 ms/batch
Step 00000450 L_out=3.479 Acc=0.68 ---  613.10 ms/batch
Step 00000500 L_out=2.372 Acc=0.70 ---  599.65 ms/batch
0 / 100
50 / 100
eval accuracy: 0.5389197649274553
mean IOU:0.11349602616156579
Mean IoU = 11.3%
--------------------------------------------------------------------------------------
11.35 | 38.58 46.72 40.62  0.02  0.00  0.00  0.00  3.18  0.00 18.42  0.00  0.00  0.00 
--------------------------------------------------------------------------------------

Best m_IoU of SensatUrban is: 11.350
****EPOCH 1****
Step 00000550 L_out=4.056 Acc=0.64 ---  593.91 ms/batch
Traceback (most recent call last):
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[262144,13] and type int64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node results/in_top_k/InTopKV2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[optimizer/gradients/layers/Encoder_layer_4shortcut/Conv2D_grad/tuple/control_dependency_1/_935]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[262144,13] and type int64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node results/in_top_k/InTopKV2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main_SensatUrban.py", line 265, in <module>
    model.train(dataset)
  File "/media/takeda/0DC3-0000/SensatUrban/RandLANet.py", line 158, in train
    _, _, summary, l_out, probs, labels, acc = self.sess.run(ops, {self.is_training: True})
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[262144,13] and type int64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node results/in_top_k/InTopKV2 (defined at /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:85) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[optimizer/gradients/layers/Encoder_layer_4shortcut/Conv2D_grad/tuple/control_dependency_1/_935]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[262144,13] and type int64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node results/in_top_k/InTopKV2 (defined at /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:85) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node results/in_top_k/InTopKV2:
 loss/GatherV2_2 (defined at /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:75)   
 loss/GatherV2 (defined at /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:67)

Input Source operations connected to node results/in_top_k/InTopKV2:
 loss/GatherV2_2 (defined at /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:75)   
 loss/GatherV2 (defined at /media/takeda/0DC3-0000/SensatUrban/RandLANet.py:67)

Original stack trace for 'results/in_top_k/InTopKV2':
  File "main_SensatUrban.py", line 264, in <module>
    model = Network(dataset, cfg)
  File "/media/takeda/0DC3-0000/SensatUrban/RandLANet.py", line 85, in __init__
    self.correct_prediction = tf.nn.in_top_k(valid_logits, valid_labels, 1)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 4784, in in_top_k
    return gen_nn_ops.in_top_kv2(predictions, targets, k, name=name)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 5040, in in_top_kv2
    "InTopKV2", predictions=predictions, targets=targets, k=k, name=name)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/home/takeda/.pyenv/versions/SensatUrban/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
Takenokono commented 3 years ago

Using packages..

(SensatUrban) [test]% pip freeze                                    (/media/takeda/0DC3-0000/SensatUrban) 15:44:33
DEPRECATION: Python 3.5 reached the end of its life on September 13th, 2020. Please upgrade your Python as Python 3.5 is no longer maintained. pip 21.0 will drop support for Python 3.5 in January 2021. pip 21.0 will remove support for this functionality.
absl-py==0.13.0
astor==0.8.1
Cython==0.29.15
gast==0.2.2
google-pasta==0.2.0
grpcio==1.39.0
h5py==2.10.0
importlib-metadata==2.1.1
joblib==0.14.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.2
Markdown==3.2.2
numpy==1.16.1
open3d-python==0.3.0.0
opt-einsum==3.3.0
pandas==0.25.3
protobuf==3.17.3
python-dateutil==2.8.2
pytz==2021.1
PyYAML==5.3.1
scikit-learn==0.21.3
scipy==1.4.1
six==1.16.0
tensorboard==1.14.0
tensorflow-estimator==1.14.0
tensorflow-gpu==1.14.0
termcolor==1.1.0
Werkzeug==1.0.1
wrapt==1.12.1
zipp==1.2.0
mpautzke commented 3 years ago

Looks like you hit the GPU memory limit