How to restore trained model and go on training with saved checkpoints (not using the method in #22

Epsilon-Lee commented 6 years ago

I use the method here by LingjiaDeng to restore checkpoint in train/eval folder. Codes are exactly below, I run it in ipython:

saver = tf.train.import_meta_graph('train/eval/model.ckpt-50000.meta')
sess = tf.Session()
saver.restore(sess, 'train/eval/model.ckpt-50000')

The error is as follow:

What is this error? Are there more elegant way to restore the training from pre-trained model parameters?

I found out that if I use a single gpu to train the model, so in parallel_model no data parallelism, the checkpoint can be successfully reloaded through my above way. Is that a problem?

Thanks very much.

Playinf commented 6 years ago

According to the error message, I think you should set allow_soft_placement=True when creating tf.Session.

Epsilon-Lee commented 6 years ago

Thanks for your quick response, and your solution quickly resolve my problem.

If you have time to answer, I have some more questions especially regard to parameter reload :)

  1. During inference (in thumt/bin/, we could only do parameter initialization using (variable name, value) pairs and assign them with an assign_op? Are there more elegant ways to reload a model during test time?
  2. Since I used to use PyTorch (always single GPU), I am curious about how Tensorflow resolve GPU resource mismatch when reload and resume training a model? That is: -- Is that Tensorflow could only use the same GPU resource (same GPU IDs) to go on training from a checkpoint, since the previously built train_op is resource-aware? Or -- Should the following be always re-executed to make sure the re-allocation of GPU resource?
    # In to re-allocate computation to newly given GPU resource
    sharded_losses = parallel.parallel_model(

Many thanks to your patience indeed!

Playinf commented 6 years ago
  1. The checkpoints can be automatically loaded by using MonitoredSession. The THUMT chose to use assign_op because we need to support model ensemble during inference.
  2. We only need saved parameters in order to restore training. The GPU assignment is done by graph construction, and a new graph will be constructed when is executed.