Closed RachelKer closed 6 years ago
By default that script looks for a global_step
tensor in the input checkpoint so we know what step number to use when saving the output weights. Its look likes the target checkpoint you used did not include that tensor and that caused the error.
You could just comment out all mentions of global_step
in the convert script and it should work, the only disadvantage is that the weights will not be saved with an associated step number in the new checkpoint.
@chrisc36 how can I include this tensor to checkpoint? Just removing global_step doesn't work for me
I assume @dusk256 you're referring to this line (111):
saver.save(sess, join(save_dir, "checkpoint"), sess.run(global_step))
I replaced sess.run there with a random value :
saver.save(sess, join(save_dir, "checkpoint"), 1)
Thanks @chrisc36 I tested it and it does work for me. I now have a slow converted CPU model on my laptop !
@chrisc36 I commented out all mentions of global_step within the code and received another error message saying File "/data/elmo/bi-att-flow-master/document-qa/docqa/scripts/convert_to_cpu.py", line 130, in convert pickle.dump(model, f) TypeError: can't pickle _thread.lock objects in line 129 Why might this be the case? Thanks a lot for your help. @RachelKer can you provide me with you email? I have some doubts regarding the conversion of models.
Hello,
First of all thank you for publishing a very complete code, and the help you still provide here. I trained your models on other datasets successfully on GPU with the 'ablate_squad.py' script, but when I try to convert them with 'convert.py' I get the following error :
2018-04-18 09:33:51.087843: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key global_step not found in checkpoint
Happening somewhere here :
File "docqa/scripts/convert_to_cpu.py", line 164, in
main()
File "docqa/scripts/convert_to_cpu.py", line 161, in main
convert(args.target_model, args.output_dir, args.best_weights)
File "docqa/scripts/convert_to_cpu.py", line 48, in convert
md.restore_checkpoint(sess)
File "/home/rachel/BIDAF_Allen/document-qa/docqa/model_dir.py", line 88, in restore_checkpoint
saver = tf.train.Saver(var_list)
I'm a bit a loss about what to do, especially since I can load this model fine for other actions (evaluations, resume training & co). Do you have any hindsight into this ?
Thanks for your help,
Rachel