allenai / document-qa

Apache License 2.0
434 stars 122 forks source link

Converting GPUs models to CPU #29

Closed RachelKer closed 6 years ago

RachelKer commented 6 years ago

Hello,

First of all thank you for publishing a very complete code, and the help you still provide here. I trained your models on other datasets successfully on GPU with the 'ablate_squad.py' script, but when I try to convert them with 'convert.py' I get the following error :

2018-04-18 09:33:51.087843: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key global_step not found in checkpoint

Happening somewhere here :

File "docqa/scripts/convert_to_cpu.py", line 164, in main() File "docqa/scripts/convert_to_cpu.py", line 161, in main convert(args.target_model, args.output_dir, args.best_weights) File "docqa/scripts/convert_to_cpu.py", line 48, in convert md.restore_checkpoint(sess) File "/home/rachel/BIDAF_Allen/document-qa/docqa/model_dir.py", line 88, in restore_checkpoint saver = tf.train.Saver(var_list)

I'm a bit a loss about what to do, especially since I can load this model fine for other actions (evaluations, resume training & co). Do you have any hindsight into this ?

Thanks for your help,

Rachel

chrisc36 commented 6 years ago

By default that script looks for a global_step tensor in the input checkpoint so we know what step number to use when saving the output weights. Its look likes the target checkpoint you used did not include that tensor and that caused the error.

You could just comment out all mentions of global_step in the convert script and it should work, the only disadvantage is that the weights will not be saved with an associated step number in the new checkpoint.

babych-d commented 6 years ago

@chrisc36 how can I include this tensor to checkpoint? Just removing global_step doesn't work for me

RachelKer commented 6 years ago

I assume @dusk256 you're referring to this line (111):

saver.save(sess, join(save_dir, "checkpoint"), sess.run(global_step))

I replaced sess.run there with a random value :

saver.save(sess, join(save_dir, "checkpoint"), 1)

Thanks @chrisc36 I tested it and it does work for me. I now have a slow converted CPU model on my laptop !

Anurag461 commented 6 years ago

@chrisc36 I commented out all mentions of global_step within the code and received another error message saying File "/data/elmo/bi-att-flow-master/document-qa/docqa/scripts/convert_to_cpu.py", line 130, in convert pickle.dump(model, f) TypeError: can't pickle _thread.lock objects in line 129 Why might this be the case? Thanks a lot for your help. @RachelKer can you provide me with you email? I have some doubts regarding the conversion of models.