Closed Epsilon-Lee closed 6 years ago
According to the error message, I think you should set allow_soft_placement=True
when creating tf.Session.
Thanks for your quick response, and your solution quickly resolve my problem.
If you have time to answer, I have some more questions especially regard to parameter reload :)
train_op
is resource-aware? Or
-- Should the following be always re-executed to make sure the re-allocation of GPU resource?
# In trainer.py to re-allocate computation to newly given GPU resource
sharded_losses = parallel.parallel_model(
model.get_training_func(initializer),
features,
params.device_list
)
Many thanks to your patience indeed!
MonitoredSession
. The THUMT chose to use assign_op
because we need to support model ensemble during inference.trainer.py
is executed.
I use the method here by LingjiaDeng to restore checkpoint in
train/eval
folder. Codes are exactly below, I run it in ipython:The error is as follow:
What is this error? Are there more elegant way to restore the training from pre-trained model parameters?
I found out that if I use a single gpu to train the model, so in parallel_model no data parallelism, the checkpoint can be successfully reloaded through my above way. Is that a problem?
Thanks very much.