Errors when resuming training from saved checkpoint

mandar2812 commented 6 years ago

It seems after some recent updates, resuming training on MNIST/CIFAR does not work anymore and tensorflow_scala is not able to respawn a previously trained model and begin training it further.

2017-11-23 16:01:59.018850: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Bias_0/Bias_0_2/Bias not found in checkpoint
2017-11-23 16:01:59.018947: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Layer_2/Layer_2_2/Bias/AdaGrad not found in checkpoint
2017-11-23 16:01:59.018893: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Bias_1/Bias_1_2/Bias/AdaGrad not found in checkpoint
2017-11-23 16:01:59.019198: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Conv2D_1/Conv2D_1_2/Weights not found in checkpoint
2017-11-23 16:01:59.019969: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/OutputLayer/OutputLayer_2/Bias/AdaGrad not found in checkpoint
2017-11-23 16:01:59.020664: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Layer_2/Layer_2_2/Weights/AdaGrad not found in checkpoint
2017-11-23 16:01:59.020666: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/OutputLayer/OutputLayer_2/Weights/AdaGrad not found in checkpoint
2017-11-23 16:01:59.020666: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Conv2D_0/Conv2D_0_2/Weights not found in checkpoint
2017-11-23 16:01:59.021133: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Layer_2/Layer_2_2/Bias not found in checkpoint
2017-11-23 16:01:59.022401: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key OutputLayer/OutputLayer_2/Bias not found in checkpoint
2017-11-23 16:01:59.022537: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Conv2D_0/Conv2D_0_2/Weights/AdaGrad not found in checkpoint
2017-11-23 16:01:59.024300: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Conv2D_1/Conv2D_1_2/Weights/AdaGrad not found in checkpoint
2017-11-23 16:01:59.028290: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Layer_2/Layer_2_2/Weights not found in checkpoint
2017-11-23 16:01:59.028339: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key OutputLayer/OutputLayer_2/Weights not found in checkpoint
2017-11-23 16:01:59.029101: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Bias_1/Bias_1_2/Bias not found in checkpoint
2017-11-23 16:01:59.029148: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Bias_0/Bias_0_2/Bias/AdaGrad not found in checkpoint
org.platanios.tensorflow.jni.NotFoundException: Key Layer_2/Layer_2_5/Bias not found in checkpoint
     [[Node: Saver/Restore_1 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_Saver/Constant_0_0, Saver/Constant_9, Saver/Constant_10)]]
  org.platanios.tensorflow.jni.Session$.run(Native Method)
  org.platanios.tensorflow.api.core.client.Session.runHelper(Session.scala:137)
  org.platanios.tensorflow.api.core.client.Session.run(Session.scala:76)
  org.platanios.tensorflow.api.ops.variables.Saver.restore(Saver.scala:227)
  org.platanios.tensorflow.api.learn.SessionManager.restoreCheckpoint(SessionManager.scala:297)
  org.platanios.tensorflow.api.learn.SessionManager.prepareSession(SessionManager.scala:135)
  org.platanios.tensorflow.api.learn.ChiefSessionCreator.createSession(SessionCreator.scala:92)
  org.platanios.tensorflow.api.learn.HookedSessionCreator.createSession(SessionCreator.scala:166)
  org.platanios.tensorflow.api.learn.RecoverableSession$$anonfun$createSession$1.apply$mcV$sp(SessionWrapper.scala:304)
  org.platanios.tensorflow.api.learn.RecoverableSession$$anonfun$createSession$1.apply(SessionWrapper.scala:304)
  org.platanios.tensorflow.api.learn.RecoverableSession$$anonfun$createSession$1.apply(SessionWrapper.scala:304)
  scala.util.control.Exception$Catch.apply(Exception.scala:103)
  org.platanios.tensorflow.api.learn.RecoverableSession$.createSession(SessionWrapper.scala:303)
  org.platanios.tensorflow.api.learn.RecoverableSession.<init>(SessionWrapper.scala:242)
  org.platanios.tensorflow.api.learn.RecoverableSession$.apply(SessionWrapper.scala:241)
  org.platanios.tensorflow.api.learn.MonitoredSession$.apply(SessionWrapper.scala:417)
  org.platanios.tensorflow.api.learn.estimators.Estimator$.monitoredTrainingSession(Estimator.scala:337)
  org.platanios.tensorflow.api.learn.estimators.InMemoryEstimator$$anonfun$5.apply(InMemoryEstimator.scala:129)
  org.platanios.tensorflow.api.learn.estimators.InMemoryEstimator$$anonfun$5.apply(InMemoryEstimator.scala:122)
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  org.platanios.tensorflow.api.ops.Op$.createWith(Op.scala:844)
  org.platanios.tensorflow.api.learn.estimators.InMemoryEstimator.<init>(InMemoryEstimator.scala:122)
  org.platanios.tensorflow.api.learn.estimators.InMemoryEstimator$.apply(InMemoryEstimator.scala:360)
  ammonite.$sess.cmd1$.<init>(cmd1.sc:34)
  ammonite.$sess.cmd1$.<clinit>(cmd1.sc)

mandar2812 commented 6 years ago

Just for reference, this was working till the snapshots of Nov 20 2017.

mandar2812 commented 6 years ago

Also I am not noticing this issue with the PTB example, only for MNIST and CIFAR

eaplatanios commented 6 years ago

Is that while trying to use previously created checkpoints? If so, could you try clearing up the checkpoints directory and running again? After it starts saving new checkpoints could you try stopping and rerunning? It may be that old checkpoints are not compatible with the current model because I made some edits to the optimizers.

mandar2812 commented 6 years ago

Yes because you changed the optimisation class 😊. That's why I cleared the checkpoint cache and ran the cifar and mnist experiment twice, the second time they fail to load the checkpoint information. I will double check this to make sure to be sure without doubt.

mandar2812 commented 6 years ago

Context

I am running this code within the DynaML REPL which is itself a customised form of the Ammonite REPL. The problem occurs only when i copy paste the mnist or cifar scripts into the REPL twice in the same REPL session. If i exit and restart the REPL, the models can resume from the checkpoints.

I think this might be some issue with the namespace in the REPL. But now I am not sure if its a REPL issue or something in tf_scala.

Although this is not a big problem because there is a workaround, and I might not be running these scripts in REPL mode but rather as command line execution in production environments. Still it would make prototyping less painful if the restoration works seamlessly without having to restart the REPL session.

Update

So I tried it once more and there is some weird behaviour:

The cifar example was trained till epoch 3000 with checkpoints at each 100 epoch

The second time I ran it, it restored and kept training the network.

2017-11-24 11:15:52.499 [main] INFO  Learn / Hooks / TensorBoard - Launching TensorBoard in '127.0.0.1:8080' for log directory '/Users/mandar/tmp/cifar_summaries'.
2017-11-24 11:15:52.605 [main] INFO  Variables / Saver - Restoring parameters from '/Users/mandar/tmp/cifar_summaries/model.ckpt-3000'.
2017-11-24 11:16:00.318 [main] INFO  Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3001.
2017-11-24 11:16:00.318 [main] INFO  Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:16:24.280 [main] INFO  Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3101.
2017-11-24 11:16:24.280 [main] INFO  Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:16:49.156 [main] INFO  Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3201.
2017-11-24 11:16:49.156 [main] INFO  Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:17:14.843 [main] INFO  Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3301.
2017-11-24 11:17:14.843 [main] INFO  Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:17:39.113 [main] INFO  Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3401.
2017-11-24 11:17:39.113 [main] INFO  Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:18:05.884 [main] INFO  Learn / Hooks / Checkpoint Saver - Saving checkpoint for step 3501.
2017-11-24 11:18:05.884 [main] INFO  Variables / Saver - Saving parameters to '/Users/mandar/tmp/cifar_summaries/model.ckpt'.
2017-11-24 11:18:06.027 [main] INFO  Learn / Hooks / Termination - Stop requested: Exceeded maximum number of steps.
2017-11-24 11:18:06.550 [main] INFO  Learn / Hooks / Termination - Stop requested: Exceeded maximum number of steps.
Train accuracy = 0.46974F
Test accuracy = 0.4113F

I run it again and it errors

2017-11-24 11:30:54.532 [main] INFO  Learn / Hooks / TensorBoard - Launching TensorBoard in '127.0.0.1:8080' for log directory '/Users/mandar/tmp/cifar_summaries'.
2017-11-24 11:30:54.554 [main] INFO  Variables / Saver - Restoring parameters from '/Users/mandar/tmp/cifar_summaries/model.ckpt-3501'.
2017-11-24 11:30:54.580315: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Layer_2/Layer_2_2/Weights not found in checkpoint
2017-11-24 11:30:54.580543: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Conv2D_1/Conv2D_1_1/Weights not found in checkpoint
2017-11-24 11:30:54.581232: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Bias_0/Bias_0_1/Bias not found in checkpoint
2017-11-24 11:30:54.583619: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Conv2D_0/Conv2D_0_1/Weights not found in checkpoint
2017-11-24 11:30:54.583806: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Bias_1/Bias_1_1/Bias/AdaGrad not found in checkpoint
2017-11-24 11:30:54.584598: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key OutputLayer/OutputLayer_2/Weights not found in checkpoint
2017-11-24 11:30:54.585492: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Layer_2/Layer_2_2/Bias/AdaGrad not found in checkpoint
2017-11-24 11:30:54.586125: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key OutputLayer/OutputLayer_2/Bias not found in checkpoint
2017-11-24 11:30:54.587363: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Conv2D_1/Conv2D_1_1/Weights/AdaGrad not found in checkpoint
2017-11-24 11:30:54.588257: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Conv2D_0/Conv2D_0_1/Weights/AdaGrad not found in checkpoint
2017-11-24 11:30:54.588653: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Layer_2/Layer_2_2/Bias not found in checkpoint
2017-11-24 11:30:54.591538: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/OutputLayer/OutputLayer_2/Bias/AdaGrad not found in checkpoint
2017-11-24 11:30:54.591566: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Bias_0/Bias_0_1/Bias/AdaGrad not found in checkpoint
2017-11-24 11:30:54.593039: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/Layer_2/Layer_2_2/Weights/AdaGrad not found in checkpoint
2017-11-24 11:30:54.593741: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key AdaGrad/OutputLayer/OutputLayer_2/Weights/AdaGrad not found in checkpoint
2017-11-24 11:30:54.593754: W tensorflow/core/framework/op_kernel.cc:1194] Not found: Key Bias_1/Bias_1_1/Bias not found in checkpoint

eaplatanios commented 6 years ago

@mandar2812 Oh I see. This is actually expected. It happens because by copying the script twice you're adding ops to the same graph and they get automatically assigned unique names with IDs appended to them if the names are re-used. Therefore, the names in the saved checkpoints do not match. You can try copying the graph creation code within a tf.createWith(graph = Graph()) { ... } scope so you always create them within a new graph. Could you please try this and let me know what happens? :)

eaplatanios commented 6 years ago

@mandar2812 Did this work out?

mandar2812 commented 6 years ago

@eaplatanios I will try this out soon and update this issue.

mandar2812 commented 6 years ago

@eaplatanios After running the script once, if I want to run it again in the repl then this works

tf.createWith(graph = Graph()) {
         val model = tf.learn.Model(input, layer, trainInput, trainingInputLayer, loss, optimizer)

         println("Training the linear regression model.")
         val summariesDir = java.nio.file.Paths.get((tempdir/"cifar_summaries").toString())
         val estimator = tf.learn.InMemoryEstimator(
           model,
           tf.learn.Configuration(Some(summariesDir)),
           tf.learn.StopCriteria(maxSteps = Some(100000)),
           Set(
             tf.learn.StepRateLogger(log = false, summaryDir = summariesDir, trigger = tf.learn.StepHookTrigger(100)),
             tf.learn.SummarySaver(summariesDir, tf.learn.StepHookTrigger(100)),
             tf.learn.CheckpointSaver(summariesDir, tf.learn.StepHookTrigger(100))),
           tensorBoardConfig = tf.learn.TensorBoardConfig(summariesDir, reloadInterval = 100))
         estimator.train(() => trainData, tf.learn.StopCriteria(maxSteps = Some(500)))
       } 
2017-11-29 18:24:59.058054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1152] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:03:00.0, compute capability: 6.1)
Training the linear regression model.

Thanks for the tip!

eaplatanios commented 6 years ago

@mandar2812 No problem! I'm glad this helped. :)

Generally, even though I try to make it transparent as much as possible in my API, it's good to be aware of which graph ops being created belong in and what their names are.

eaplatanios / tensorflow_scala

Errors when resuming training from saved checkpoint #56

Context

Update