Closed 7017227 closed 6 years ago
When I tesst 'checkpoint' it shows dataloss error and when I test .meta or. data , it shows
NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for out/lsp_alexnet_imagenet_small/checkpoint-100000.data [[Node: save/RestoreV2_3 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_3/tensor_names, save/RestoreV2_3/shape_and_slices)]] [[Node: save/RestoreV2/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_74_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
this error
I will have to check this. Now I'm a bit busy with other projects.
Try to change the tf.Saver
version.
How to change tf.Saver version?
I tried to test trained lsp datasets during training
wonjinlee@alpha:~/deeppose/out/lsp_alexnet_imagenet_small$ ls checkpoint events.out.tfevents.1510238719.alpha checkpoint-100000.data-00000-of-00001 params.dump_171108_222950.txt checkpoint-100000.index params.dump_171108_223930.txt checkpoint-100000.meta params.dump_171108_224108.txt checkpoint-110000.data-00000-of-00001 params.dump_171108_224641.txt checkpoint-110000.index params.dump_171109_002231.txt checkpoint-110000.meta params.dump_171109_020558.txt checkpoint-120000.data-00000-of-00001 params.dump_171109_034216.txt checkpoint-120000.index params.dump_171109_043955.txt checkpoint-120000.meta params.dump_171109_060922.txt checkpoint-130000.data-00000-of-00001 params.dump_171109_061701.txt checkpoint-130000.index params.dump_171109_145127.txt checkpoint-130000.meta params.dump_171109_145344.txt checkpoint-90000.data-00000-of-00001 params.dump_171109_145635.txt checkpoint-90000.index params.dump_171109_170839.txt checkpoint-90000.meta params.dump_171109_234514.txt
But it shows this kind of error.
2017-11-10 17:42:08.970095: W tensorflow/core/framework/op_kernel.cc:1192] Data loss: Unable to open table file out/lsp_alexnet_imagenet_small/checkpoint: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) File "/usr/lib/python3.5/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file out/lsp_alexnet_imagenet_small/checkpoint: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? [[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]] [[Node: save/RestoreV2/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_74_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "tests/test_snapshot.py", line 116, in
main(dataset_name, snapshot_path)
File "tests/test_snapshot.py", line 79, in main
test_net(test_dataset, test_iterator, dataset_name, snapshot_path)
File "tests/test_snapshot.py", line 92, in test_net
gpu_memory_fraction=0.32) # Set how much GPU memory to reserve for the network
File "/home/wonjinlee/deeppose/scripts/regressionnet.py", line 94, in create_regression_net
saver.restore(net.sess, init_snapshot_path)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1560, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file out/lsp_alexnet_imagenet_small/checkpoint: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
[[Node: save/RestoreV2/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_74_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op 'save/RestoreV2_5', defined at: File "tests/test_snapshot.py", line 116, in
main(dataset_name, snapshot_path)
File "tests/test_snapshot.py", line 79, in main
test_net(test_dataset, test_iterator, dataset_name, snapshot_path)
File "tests/test_snapshot.py", line 92, in test_net
gpu_memory_fraction=0.32) # Set how much GPU memory to reserve for the network
File "/home/wonjinlee/deeppose/scripts/regressionnet.py", line 93, in create_regression_net
saver = tf.train.Saver()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1140, in init
self.build()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1172, in build
filename=self._filename)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 688, in build
restore_sequentially, reshape)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 663, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
DataLossError (see above for traceback): Unable to open table file out/lsp_alexnet_imagenet_small/checkpoint: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? [[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]] [[Node: save/RestoreV2/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_74_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Why this kind of error happens? Test doesn't work while training? How can I resolve this error?