Andreas-Pfeuffer / LSTM-ICNet

Tensorflow Implementation of "Semantic Segmentation of Video Sequences with Convolutional LSTMs" and "Separable Convolutional LSTMs for Faster Video Segmentation"
MIT License
20 stars 2 forks source link

Error during inference #4

Open gruossomonica opened 3 years ago

gruossomonica commented 3 years ago

Hi, Thanks for your interesting work. I tried to perform network inference: sh scripts/inference_LSTM_ICNet.sh. I built the singularity container correctly on Ubuntu 18.04. I used a GTX 1060 GPU with 6 GB memory. I got the following error restoring the pre-trained model: tensorflow.python.framework.errors_impl.DataLossError: not an sstable (bad magic number)

Here the full error in my terminal:

2020-11-20 12:20:53.647808: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-11-20 12:20:56.292789: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
this dir: /media/monica/DATA/documenti/Monica/vos_approaches/LSTM-ICNet
add /media/monica/DATA/documenti/Monica/vos_approaches/LSTM-ICNet/src to PYTHONPATH
add /media/monica/DATA/documenti/Monica/vos_approaches/LSTM-ICNet/src/image_reader to PYTHONPATH
add /media/monica/DATA/documenti/Monica/vos_approaches/LSTM-ICNet/src/datasets to PYTHONPATH
add /media/monica/DATA/documenti/Monica/vos_approaches/LSTM-ICNet/models to PYTHONPATH
add /media/monica/DATA/documenti/Monica/vos_approaches/LSTM-ICNet/models/operations to PYTHONPATH
add /media/monica/DATA/documenti/Monica/vos_approaches/LSTM-ICNet/tools to PYTHONPATH
WARNING: Logging before flag parsing goes to stderr.
W1120 12:20:57.192409 140669042886464 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

('set input_size of dataset to:', [1024, 2048])
('set input_size of dataset to:', [1024, 2048])
set model variant to: end2end
2020-11-20 12:20:57.238773: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2299965000 Hz
2020-11-20 12:20:57.240917: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5585aef162d0 executing computations on platform Host. Devices:
2020-11-20 12:20:57.240941: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-20 12:20:57.244859: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-11-20 12:20:57.301455: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:20:57.301803: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5585aef0f390 executing computations on platform CUDA. Devices:
2020-11-20 12:20:57.301825: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1060, Compute Capability 6.1
2020-11-20 12:20:57.301983: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:20:57.302254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
2020-11-20 12:20:57.302277: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-11-20 12:20:57.302385: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-11-20 12:20:57.302425: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-11-20 12:20:57.302470: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-11-20 12:20:57.338044: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-11-20 12:20:57.345000: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-11-20 12:20:57.345108: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-11-20 12:20:57.345201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:20:57.345495: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:20:57.345714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-11-20 12:20:59.340062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-20 12:20:59.340101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2020-11-20 12:20:59.340109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2020-11-20 12:20:59.342055: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:20:59.342373: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:20:59.342622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 5391 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
found devices:  [u'/device:GPU:0']
batch_size /gpu:0 : 1
W1120 12:20:59.348171 140669042886464 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
time_sequence length: 4
W1120 12:21:02.311958 140669042886464 deprecation.py:323] From /media/monica/DATA/documenti/Monica/vos_approaches/LSTM-ICNet/models/network.py:357: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
kernel:  [3, 3, 256, 512]
W1120 12:21:04.612608 140669042886464 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/autograph/impl/api.py:253: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
time_sequence length: 4
kernel:  [3, 3, 256, 512]
time_sequence length: 4
kernel:  [3, 3, 256, 512]
W1120 12:21:05.258019 140669042886464 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/util/dispatch.py:180: calling expand_dims (from tensorflow.python.ops.array_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
W1120 12:21:05.266261 140669042886464 deprecation.py:323] From inference.py:218: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1120 12:21:05.320256 140669042886464 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py:1179: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2020-11-20 12:21:05.368774: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:21:05.369116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
2020-11-20 12:21:05.369144: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-11-20 12:21:05.369257: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-11-20 12:21:05.369280: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-11-20 12:21:05.369308: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-11-20 12:21:05.369355: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-11-20 12:21:05.369390: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-11-20 12:21:05.369413: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-11-20 12:21:05.369518: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:21:05.369837: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:21:05.370087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-11-20 12:21:05.370113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-20 12:21:05.370121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2020-11-20 12:21:05.370127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2020-11-20 12:21:05.370196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:21:05.370457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:21:05.370679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5391 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-11-20 12:21:05.912537: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
results/2020_02_07b_LSTM_ICNet_v5_cityscape_sequence_4_color_19_batch1_60k
W1120 12:21:06.259293 140669042886464 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-20 12:21:06.663228: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Data loss: not an sstable (bad magic number)
Traceback (most recent call last):
  File "inference.py", line 718, in <module>
    main()
  File "inference.py", line 712, in main
    print('mIoU: {}'.format(evaluate(config, evaluation_set=args.evaluation_set, plot_confusionMatrix=args.plot_confusionMatrix)))
  File "inference.py", line 263, in evaluate
    load(loader, sess, ckpt.model_checkpoint_path)
  File "inference.py", line 100, in load
    saver.restore(sess, ckpt_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: not an sstable (bad magic number)
     [[node save/RestoreV2 (defined at inference.py:262) ]]

Original stack trace for u'save/RestoreV2':
  File "inference.py", line 718, in <module>
    main()
  File "inference.py", line 712, in main
    print('mIoU: {}'.format(evaluate(config, evaluation_set=args.evaluation_set, plot_confusionMatrix=args.plot_confusionMatrix)))
  File "inference.py", line 262, in evaluate
    loader = tf.compat.v1.train.Saver(var_list=tf.compat.v1.global_variables())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Can anyone help me? What went wrong?

Andreas-Pfeuffer commented 3 years ago

Unfortunately, I cannot reproduce your error (GTX 1070 GPU with 8 GB memory)

However, your GPU seems to be very small. Perhaps, a GPU with more memory solves this problem. Alternatively, you can also reduce the sequence-length from 4 to 3 or to 2.