eldar / pose-tensorflow

Human Pose estimation with TensorFlow framework
GNU Lesser General Public License v3.0
1.14k stars 384 forks source link

saveRestore issue? #17

Closed trops closed 5 years ago

trops commented 7 years ago

Hi, thank you for the awesome library, I have a question about an error I am getting "occasionally" when running the demo/singleperson.py file. Every other time or every third time it runs I get the following issue below. I have read that it has to do with saveRestore and changing variables, but not quite sure how to resolve or if anyone else is having this issue: Thanks!

Here is the trace error:

Traceback (most recent call last):
  File "demo/detect.py", line 164, in <module>
    sess, inputs, outputs = predict.setup_pose_prediction(cfg)
  File "demo/../nnet/predict.py", line 17, in setup_pose_prediction
    restorer.restore(sess, cfg.init_weights)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1548, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: models/mpii/mpii-single-resnet-101.index
     [[Node: save/RestoreV2_186 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_186/tensor_names, save/RestoreV2_186/shape_and_slices)]]

Caused by op 'save/RestoreV2_186', defined at:
  File "demo/detect.py", line 164, in <module>
    sess, inputs, outputs = predict.setup_pose_prediction(cfg)
  File "demo/../nnet/predict.py", line 9, in setup_pose_prediction
    restorer = tf.train.Saver()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): models/mpii/mpii-single-resnet-101.index
     [[Node: save/RestoreV2_186 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_186/tensor_names, save/RestoreV2_186/shape_and_slices)]]
eldar commented 7 years ago

Hi! Thanks for reporting this. I have not seen such issue myself. Quick google reveals that this can happen because of initialization. Here's the code in question: https://github.com/eldar/pose-tensorflow/blob/master/nnet/predict.py#L15-L21

According to this https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/D_rVoQStCJg and this https://stackoverflow.com/questions/33759623/tensorflow-how-to-save-restore-a-model/43784657#comment75010100_43784657 one should initialise first and then restore, which is what is in the code anyway.

Another comment there says one doesn't need to initialize when you restore variables. You could try to comment out initialisation (lines 17-18). And see if it fixes things for you. (Not sure if it's the right behavior during training).

trops commented 7 years ago

Thanks for the response! The issue is, if I run the singleperson multiple times, the restore fails. So something in the demo/singleperson works on run #1, but then when the program stops the first time, and it worked awesome, I run it a second time on say, another image, and I get the restore issue.

So maybe the session is saving the file, and then it can't restore it? I am messing around with the code in the predict.py file trying to figure out what the session is doing, and even writing my own functions to work with sessions to determine what is happening. Super weird. This i=would be enormous to figure out.

trops commented 7 years ago

I read files from a directory and run the detection, the first entire loop works great, sometimes the second, but then ultimately it fails with the following:

tensorflow.python.framework.errors_impl.FailedPreconditionError: /tf_files/pose_tensorflow/models/mpii/mpii-single-resnet-101.index
     [[Node: save/RestoreV2_235 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_235/tensor_names, save/RestoreV2_235/shape_and_slices)]]
eldar commented 7 years ago

Wait, so you do this multiple times? Can you send me the exact code you have? Because you shouldn't load the model on each iteration, only once when the program starts. But, for each image you should execute only lines 19-34 from singleperson.py. everything before line 19 is model loading and set up that you do only once.

eldar commented 5 years ago

Closing because of the lack of activity for more than a year.