Open yzbx opened 6 years ago
diff --git a/deeplab/train.py b/deeplab/train.py
index 1a2576c..15e3c8b 100644
--- a/deeplab/train.py
+++ b/deeplab/train.py
@@ -385,12 +385,7 @@ def main(unused_argv):
is_chief=(FLAGS.task == 0),
session_config=session_config,
startup_delay_steps=startup_delay_steps,
- init_fn=train_utils.get_model_init_fn(
- FLAGS.train_logdir,
- FLAGS.tf_initial_checkpoint,
- FLAGS.initialize_last_layer,
- last_layers,
- ignore_missing_vars=True),
+ init_fn=None,
summary_op=summary_op,
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs)
set init_fn = None, run model without error!!!
(new) ➜ deeplab git:(master) ✗ sh test/train.sh
current path is /home/yzbx/git/deeplab
/home/yzbx/bin/miniconda3/envs/new/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
INFO:tensorflow:Training on train set
first clone label name is: label:0
WARNING:tensorflow:From /home/yzbx/bin/miniconda3/envs/new/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-08-21 15:13:17.575832: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-08-21 15:13:17.752743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:0a:00.0
totalMemory: 11.90GiB freeMemory: 11.75GiB
2018-08-21 15:13:17.752789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-21 15:13:17.993605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-21 15:13:17.993641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-08-21 15:13:17.993651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-08-21 15:13:17.993918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11376 MB memory) -> physical GPU (device: 0, name: TITAN X (Pascal), pci bus id: 0000:0a:00.0, compute capability: 6.1)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/yzbx/tmp/logs/tensorflow/deeplab/cityscapes/xception_65/2018-08-21__15-13-01/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
2018-08-21 15:13:35.167063: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.65GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-08-21 15:13:35.167138: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.65GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 7.7412 (0.978 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
dump the model variable
<tf.Variable 'logits/semantic/weights:0' shape=(1, 1, 256, 19) dtype=float32_ref>
<tf.Variable 'logits/semantic/biases:0' shape=(19,) dtype=float32_ref>
<tf.Variable 'logits/semantic/weights/Momentum:0' shape=(1, 1, 256, 19) dtype=float32_ref>
<tf.Variable 'logits/semantic/biases/Momentum:0' shape=(19,) dtype=float32_ref>
git diff deeplab/train.py
- # Start the training.
- slim.learning.train(
- train_tensor,
- logdir=FLAGS.train_logdir,
- log_every_n_steps=FLAGS.log_steps,
- master=FLAGS.master,
- number_of_steps=FLAGS.training_number_of_steps,
- is_chief=(FLAGS.task == 0),
- session_config=session_config,
- startup_delay_steps=startup_delay_steps,
- init_fn=train_utils.get_model_init_fn(
- FLAGS.train_logdir,
- FLAGS.tf_initial_checkpoint,
- FLAGS.initialize_last_layer,
- last_layers,
- ignore_missing_vars=True),
- summary_op=summary_op,
- save_summaries_secs=FLAGS.save_summaries_secs,
- save_interval_secs=FLAGS.save_interval_secs)
-
+
+ for var in tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES):
+ if var.name.find('logits')>=0:
+ print(var)
<tf.Variable 'logits/semantic/weights:0' shape=(1, 1, 256, 19) dtype=float32_ref>
<tf.Variable 'logits/semantic/biases:0' shape=(19,) dtype=float32_ref>
<tf.Variable 'logits/semantic/weights/Momentum:0' shape=(1, 1, 256, 19) dtype=float32_ref>
<tf.Variable 'logits/semantic/biases/Momentum:0' shape=(19,) dtype=float32_ref>
The problem is that you call model.optimizer.minimize too late. This methods creates additional tensors within your graph, so calling it within a loop is bad idea - it is something similar to a memory leak. Also, in case of stateful optimizers (such as AdamOptimizer) minimize creates additional variables. That's why you get exception you described - your initializer runs before you create them. The solution for you will be to place call to model.optimizer.minimize within the model class itself, and store its result in model`s attribute. So, your problem does not refer to this issue.
error