How to restore trained model and go on training with saved checkpoints (not using the method in translator.py)?

Epsilon-Lee commented 6 years ago

I use the method here by LingjiaDeng to restore checkpoint in train/eval folder. Codes are exactly below, I run it in ipython:

saver = tf.train.import_meta_graph('train/eval/model.ckpt-50000.meta')
sess = tf.Session()
saver.restore(sess, 'train/eval/model.ckpt-50000')

The error is as follow:

Caused by op u'create_train_op/gradients/parallel_1/transformer/Gather_grad/Shape', defined at:
  File "/usr/local/bin/ipython", line 11, in <module>
    sys.exit(start_ipython())
  File "/usr/local/lib/python2.7/dist-packages/IPython/__init__.py", line 119, in start_ipython
    return launch_new_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python2.7/dist-packages/IPython/terminal/ipapp.py", line 355, in start
    self.shell.mainloop()
  File "/usr/local/lib/python2.7/dist-packages/IPython/terminal/interactiveshell.py", line 493, in mainloop
    self.interact()
  File "/usr/local/lib/python2.7/dist-packages/IPython/terminal/interactiveshell.py", line 484, in interact
    self.run_cell(code, store_history=True)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-9adf313a28f3>", line 1, in <module>
    saver = tf.train.import_meta_graph('train_bpe_nosrctgtpos/eval/model.ckpt-50000.meta')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1810, in import_meta_graph
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/meta_graph.py", line 660, in import_scoped_meta_graph
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/meta_graph.py", line 660, in import_scoped_meta_graph
    producer_op_list=producer_op_list)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 313, in import_graph_def
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'create_train_op/gradients/parallel_1/transformer/Gather_grad/Shape' and 'create_tr
ain_op/Adam/update_transformer/source_embedding/sub_3/x: Cannot merge devices with incompatible ids: '/device:GPU:1' and '/device:GPU:0'
         [[Node: create_train_op/gradients/parallel_1/transformer/Gather_grad/Shape = Const[_class=["loc:@parallel_1/transformer/Gather", "loc:@transform
er/source_embedding"], dtype=DT_INT64, value=Tensor<type: int64 shape: [2] values: 34673 512>, _device="/device:GPU:1"]()]]

What is this error? Are there more elegant way to restore the training from pre-trained model parameters?

I found out that if I use a single gpu to train the model, so in parallel_model no data parallelism, the checkpoint can be successfully reloaded through my above way. Is that a problem?

Thanks very much.

Playinf commented 6 years ago

According to the error message, I think you should set allow_soft_placement=True when creating tf.Session.

Epsilon-Lee commented 6 years ago

Thanks for your quick response, and your solution quickly resolve my problem.

If you have time to answer, I have some more questions especially regard to parameter reload :)

During inference (in thumt/bin/translator.py), we could only do parameter initialization using (variable name, value) pairs and assign them with an assign_op? Are there more elegant ways to reload a model during test time?
Since I used to use PyTorch (always single GPU), I am curious about how Tensorflow resolve GPU resource mismatch when reload and resume training a model? That is: -- Is that Tensorflow could only use the same GPU resource (same GPU IDs) to go on training from a checkpoint, since the previously built train_op is resource-aware? Or -- Should the following be always re-executed to make sure the re-allocation of GPU resource?
```
# In trainer.py to re-allocate computation to newly given GPU resource
sharded_losses = parallel.parallel_model(
        model.get_training_func(initializer),
        features,
        params.device_list
    )
```

Many thanks to your patience indeed!

Playinf commented 6 years ago

The checkpoints can be automatically loaded by using MonitoredSession. The THUMT chose to use assign_op because we need to support model ensemble during inference.
We only need saved parameters in order to restore training. The GPU assignment is done by graph construction, and a new graph will be constructed when trainer.py is executed.

THUNLP-MT / THUMT

How to restore trained model and go on training with saved checkpoints (not using the method in translator.py)? #22