Closed mandar2812 closed 6 years ago
Just for reference, this was working till the snapshots of Nov 20 2017.
Also I am not noticing this issue with the PTB example, only for MNIST and CIFAR
Is that while trying to use previously created checkpoints? If so, could you try clearing up the checkpoints directory and running again? After it starts saving new checkpoints could you try stopping and rerunning? It may be that old checkpoints are not compatible with the current model because I made some edits to the optimizers.
Yes because you changed the optimisation class 😊. That's why I cleared the checkpoint cache and ran the cifar and mnist experiment twice, the second time they fail to load the checkpoint information. I will double check this to make sure to be sure without doubt.
I am running this code within the DynaML REPL which is itself a customised form of the Ammonite REPL. The problem occurs only when i copy paste the mnist or cifar scripts into the REPL twice in the same REPL session. If i exit and restart the REPL, the models can resume from the checkpoints.
I think this might be some issue with the namespace in the REPL. But now I am not sure if its a REPL issue or something in tf_scala.
Although this is not a big problem because there is a workaround, and I might not be running these scripts in REPL mode but rather as command line execution in production environments. Still it would make prototyping less painful if the restoration works seamlessly without having to restart the REPL session.
So I tried it once more and there is some weird behaviour:
The cifar example was trained till epoch 3000 with checkpoints at each 100 epoch
The second time I ran it, it restored and kept training the network.
2017-11-24 11:15:52.499 [main] INFO Learn / Hooks / TensorBoard - Launching TensorBoard in '127.0.0.1:8080' for log directory '/Users/mandar/tmp/cifar_summaries'.
2017-11-24 11:15:52.605 [main] INFO Variables / Saver - Restoring parameters from '/Users/mandar/tmp/cifar_summaries/model.ckpt-3000'.
2017-11-24 11:16:00.318 [main] INFO Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3001.
2017-11-24 11:16:00.318 [main] INFO Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:16:24.280 [main] INFO Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3101.
2017-11-24 11:16:24.280 [main] INFO Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:16:49.156 [main] INFO Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3201.
2017-11-24 11:16:49.156 [main] INFO Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:17:14.843 [main] INFO Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3301.
2017-11-24 11:17:14.843 [main] INFO Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:17:39.113 [main] INFO Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3401.
2017-11-24 11:17:39.113 [main] INFO Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:18:05.884 [main] INFO Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3501.
2017-11-24 11:18:05.884 [main] INFO Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:18:06.027 [main] INFO Learn / Hooks / Termination - Stop requested: Exceeded maximum number of steps.
2017-11-24 11:18:06.550 [main] INFO Learn / Hooks / Termination - Stop requested: Exceeded maximum number of steps.
Train accuracy = 0.46974F
Test accuracy = 0.4113F
I run it again and it errors
2017-11-24 11:30:54.532 [main] INFO Learn / Hooks / TensorBoard - Launching TensorBoard in '127.0.0.1:8080' for log directory '/Users/mandar/tmp/cifar_summaries'.
2017-11-24 11:30:54.554 [main] INFO Variables / Saver - Restoring parameters from '/Users/mandar/tmp/cifar_summaries/model.ckpt-3501'.
2017-11-24 11:30:54.580315: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Layer_2/Layer_2_2/Weights not found in checkpoint
2017-11-24 11:30:54.580543: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Conv2D_1/Conv2D_1_1/Weights not found in checkpoint
2017-11-24 11:30:54.581232: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Bias_0/Bias_0_1/Bias not found in checkpoint
2017-11-24 11:30:54.583619: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Conv2D_0/Conv2D_0_1/Weights not found in checkpoint
2017-11-24 11:30:54.583806: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Bias_1/Bias_1_1/Bias/AdaGrad not found in checkpoint
2017-11-24 11:30:54.584598: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key OutputLayer/OutputLayer_2/Weights not found in checkpoint
2017-11-24 11:30:54.585492: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Layer_2/Layer_2_2/Bias/AdaGrad not found in checkpoint
2017-11-24 11:30:54.586125: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key OutputLayer/OutputLayer_2/Bias not found in checkpoint
2017-11-24 11:30:54.587363: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Conv2D_1/Conv2D_1_1/Weights/AdaGrad not found in checkpoint
2017-11-24 11:30:54.588257: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Conv2D_0/Conv2D_0_1/Weights/AdaGrad not found in checkpoint
2017-11-24 11:30:54.588653: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Layer_2/Layer_2_2/Bias not found in checkpoint
2017-11-24 11:30:54.591538: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/OutputLayer/OutputLayer_2/Bias/AdaGrad not found in checkpoint
2017-11-24 11:30:54.591566: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Bias_0/Bias_0_1/Bias/AdaGrad not found in checkpoint
2017-11-24 11:30:54.593039: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Layer_2/Layer_2_2/Weights/AdaGrad not found in checkpoint
2017-11-24 11:30:54.593741: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/OutputLayer/OutputLayer_2/Weights/AdaGrad not found in checkpoint
2017-11-24 11:30:54.593754: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Bias_1/Bias_1_1/Bias not found in checkpoint
@mandar2812 Oh I see. This is actually expected. It happens because by copying the script twice you're adding ops to the same graph and they get automatically assigned unique names with IDs appended to them if the names are re-used. Therefore, the names in the saved checkpoints do not match. You can try copying the graph creation code within a tf.createWith(graph = Graph()) { ... }
scope so you always create them within a new graph. Could you please try this and let me know what happens? :)
@mandar2812 Did this work out?
@eaplatanios I will try this out soon and update this issue.
@eaplatanios After running the script once, if I want to run it again in the repl then this works
tf.createWith(graph = Graph()) {
val model = tf.learn.Model(input, layer, trainInput, trainingInputLayer, loss, optimizer)
println("Training the linear regression model.")
val summariesDir = java.nio.file.Paths.get((tempdir/"cifar_summaries").toString())
val estimator = tf.learn.InMemoryEstimator(
model,
tf.learn.Configuration(Some(summariesDir)),
tf.learn.StopCriteria(maxSteps = Some(100000)),
Set(
tf.learn.StepRateLogger(log = false, summaryDir = summariesDir, trigger = tf.learn.StepHookTrigger(100)),
tf.learn.SummarySaver(summariesDir, tf.learn.StepHookTrigger(100)),
tf.learn.CheckpointSaver(summariesDir, tf.learn.StepHookTrigger(100))),
tensorBoardConfig = tf.learn.TensorBoardConfig(summariesDir, reloadInterval = 100))
estimator.train(() => trainData, tf.learn.StopCriteria(maxSteps = Some(500)))
}
2017-11-29 18:24:59.058054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1152] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:03:00.0, compute capability: 6.1)
Training the linear regression model.
Thanks for the tip!
@mandar2812 No problem! I'm glad this helped. :)
Generally, even though I try to make it transparent as much as possible in my API, it's good to be aware of which graph ops being created belong in and what their names are.
It seems after some recent updates, resuming training on MNIST/CIFAR does not work anymore and tensorflow_scala is not able to respawn a previously trained model and begin training it further.