Training hangs when training a GAN with multiple GPUs

alexlee-gk commented 6 years ago

I use tensorflow 1.11.0, CUDA 9.0.176 and cuDNN 7.3.1 on Ubuntu 16.04. My GPUs are nvidia titan Xp. When I was trying to train a savp model with command

CUDA_VISIBLE_DEVICES=0,1 python scripts/train.py --input_dir data/bair --dataset bair \
  --model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json \
  --output_dir logs/bair_action_free/ours_savp \
  --gpu_mem_frac 0.7

the program seems to stop runing and never move forward like in an endless loop, after outputing

2018-11-08 08:14:24.668709: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2018-11-08 08:14:25.028003: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.

What's the possible reason for this? More information is below

2018-11-08 08:13:09.286148: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2018-11-08 08:13:09.515356: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
session.run took 22.4s
recording summary
done
recording image summary
done
progress  global step 0  epoch 0  step 2560
discrim_video_sn_gan_loss (1.0238764, 1.0)
discrim_video_sn_vae_gan_loss (0.895689, 1.0)
gen_l1_loss (0.0804427, 100.0)
gen_video_sn_gan_loss (1.0158763, 1.0)
gen_video_sn_vae_gan_loss (0.8958128, 1.0)
gen_kl_loss (0.045274347, 0.0)
learning_rate 0.0002
saving model to logs/bair_action_free/ours_savp
done
2018-11-08 08:14:24.668709: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2018-11-08 08:14:25.028003: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.

_Originally posted by @Bonennult in https://github.com/alexlee-gk/video_prediction/issues/9#issuecomment-436829818_

alexlee-gk commented 6 years ago

This issue happens when training a GAN variant (i.e. the GAN or VAE-GAN) with multiple GPUs.

I'll look into this. As a temporary work-around, you can train with a single GPU.

Glooow1024 commented 6 years ago

I set CUDA_VISIBLE_DEVICES=0 and problem was solved. But I met the Resource Exhausted after that. As you said in #6 , I change batchsize to 4 and run again with the script :

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --input_dir data/bair --dataset bair \ 
  --model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json \
  --output_dir logs/bair_action_free/ours_savp \ 
  --gpu_mem_frac 0.7 \ 
  --model_hparams tv_weight=0.001,transformation=flow

Now it seems to be runing correctly, althought it still outputs the same information about topological sort failed occasionally. Thanks a lot.

alexlee-gk commented 6 years ago

Great! However, be aware that that batch size might end up being too small, and the results won't be as good as in the paper. I'll be making a few changes to improve accuracy and reduce the memory footprint. I'll post an update when I do and also when I fix the multi-GPU training.

crequena commented 5 years ago

Dear Alex,

Thank you for sharing the code of this awesome project and congratulations for your results and the very nice paper!

I believe my training also hangs due to this issue if any GAN loss is used, however it also happens when training on a single GPU (Tesla V100, CUDA 9.0.176, tf 1.9.0 and 1.12.0, cudnn 7.1.3). I find that training runs ''smoothly'' if CPU is used :)

It would be awesome if this issue is solved (I would totally help but I am not fluent enough to dig into this problem).

I am building something for climate science based on your work, it would be awesome to talk to you! Check your inbox :)

EDIT: Actually training failed on CPU too. It does so when training reaches progress_freq or summary_freq with the following error:

/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata) 898 try: 899 result = self._run(None, fetches, feed_dict, options_ptr, --> 900 run_metadata_ptr) 901 if run_metadata: 902 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1133 if final_fetches or final_targets or (handle and feed_dict_tensor): 1134 results = self._do_run(handle, final_targets, final_fetches, -> 1135 feed_dict_tensor, options, run_metadata) 1136 else: 1137 results = []

/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1314 if handle is None: 1315 return self._do_call(_run_fn, feeds, fetches, targets, options, -> 1316 run_metadata) 1317 else: 1318 return self._do_call(_prun_fn, handle, feeds, fetches)

/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1333 except KeyError: 1334 pass -> 1335 raise type(e)(node_def, op, message) 1336 1337 def _extend_graph(self):

InvalidArgumentError: Retval[4] does not have value

Though closely related it is to note that with GPU it never leaves step 0, with CPU it reaches progress_freq step. Seems like fetches['d_losses'] & fetches['g_losses'] can only be retrieved at initialization, then these are gone, so probably no real training is in progress on CPU either.

alexlee-gk commented 5 years ago

Hi Chris, I have made several improvements in the experimental branch (will soon be merged into master), including this issue with GANs. Can you try the experimental branch and see if the problem persists?

crequena commented 5 years ago

Hey Alex, thanks a lot! I seem to have some dependency problem in the experimental video_prediction/metrics.py trying to import lpips_tf. Maybe I am missing a new requirement?

alexlee-gk commented 5 years ago

Yes, you can install it with pip install -r requirements.txt. (There is one new dependency at the end of that file).

crequena commented 5 years ago

Hey Alex I get this error at build_graph . It happens with a dataset I created (shaped very similarly to kth) but also with 'bair' just as provided by you.

174 175 # inputs comes from the training dataset by default, unless train_handle is remapped to the val_handles --> 176 model.build_graph(inputs) 177 178 if long_val_dataset is not None:

/video_prediction/models/base_model.py in build_graph(self, inputs) 686 self.accum_eval_metrics = OrderedDict() 687 for name, eval_metric in self.evalmetrics.items(): --> 688 , self.accum_evalmetrics['accum' + name] = tf.metrics.mean_tensor(eval_metric) 689 local_variables = set(tf.local_variables()) - original_local_variables 690 self.accum_eval_metrics_reset_op = tf.group([tf.assign(v, tf.zeros_like(v)) for v in local_variables])

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in mean_tensor(values, weights, metrics_collections, updates_collections, name) 1294 values = math_ops.to_float(values) 1295 total = metric_variable( -> 1296 values.get_shape(), dtypes.float32, name='total_tensor') 1297 count = metric_variable( 1298 values.get_shape(), dtypes.float32, name='count_tensor')

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in metric_variable(shape, dtype, validate_shape, name) 49 ], 50 validate_shape=validate_shape, ---> 51 name=name) 52 53

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in variable(initial_value, trainable, collections, validate_shape, caching_device, name, dtype, constraint, use_resource) 2232 name=name, dtype=dtype, 2233 constraint=constraint, -> 2234 use_resource=use_resource) 2235 2236

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in (kwargs) 2222 constraint=None, 2223 use_resource=None): -> 2224 previous_getter = lambda kwargs: default_variable_creator(None, **kwargs) 2225 for getter in ops.get_default_graph()._variable_creator_stack: # pylint: disable=protected-access 2226 previous_getter = _make_getter(getter, previous_getter)

Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator(next_creator, **kwargs) 2194 collections=collections, validate_shape=validate_shape, 2195 caching_device=caching_device, name=name, dtype=dtype, -> 2196 constraint=constraint) 2197 elif not use_resource and context.executing_eagerly(): 2198 raise RuntimeError(

Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in init(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint) 310 name=name, 311 dtype=dtype, --> 312 constraint=constraint) 313 314 # pylint: disable=unused-argument

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, constraint) 415 with ops.name_scope("Initializer"), ops.device(None): 416 initial_value = ops.convert_to_tensor( --> 417 initial_value(), name="initial_value", dtype=dtype) 418 self._handle = _eager_safe_variable_handle( 419 shape=initial_value.get_shape(),

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in () 43 44 return variable_scope.variable( ---> 45 lambda: array_ops.zeros(shape, dtype), 46 trainable=False, 47 collections=[

Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py in zeros(shape, dtype, name) 1545 except (TypeError, ValueError): 1546 # Happens when shape is a list with tensor elements -> 1547 shape = ops.convert_to_tensor(shape, dtype=dtypes.int32) 1548 if not shape._shape_tuple(): 1549 shape = reshape(shape, [-1]) # Ensure it's a vector

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype) 1009 name=name, 1010 preferred_dtype=preferred_dtype, -> 1011 as_ref=False) 1012 1013

Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx) 1105 1106 if ret is None: -> 1107 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) 1108 1109 if ret is NotImplemented:

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _tensor_shape_tensor_conversion_function(s, dtype, name, as_ref) 236 if not s.is_fully_defined(): 237 raise ValueError( --> 238 "Cannot convert a partially known TensorShape to a Tensor: %s" % s) 239 s_list = s.as_list() 240 int64_value = 0

ValueError: Cannot convert a partially known TensorShape to a Tensor: (?, ?)

alexlee-gk commented 5 years ago

The problem is that one of the metrics is not returning fully defined shapes, and I suspect that it might be the new LPIPS metric causing this. If that’s the case, you can just comment this metric out: https://github.com/alexlee-gk/video_prediction/blob/experimental/video_prediction/models/base_model.py#L149

Unlike the losses, the metrics don’t affect the training.

crequena commented 5 years ago

Unfortunately commenting out that line (or L126, or every line using LPIPS in base_model.py) still leads to the same error.

crequena commented 5 years ago

Hey Alex,

Commenting out anything involving accum_eval_summary both in train.py and base_model.py allows for the training to proceed. That is commenting out: L301-315 and in L317 or should_eval(step, args.accum_eval_summary_freq) in train.py and L687-688, L711-713, L718-720, L722 in base_model.py.

If no GAN loss is used, training works! However, it still seems to get stuck if I use video_image_sn_gan_weight or image_sn_gan_weight > 0 on a single GPU. I also gave a try, blindly (just in case it matter), to the new argument, aggregate_nccl=1 with same results.

Training does not run this time around on CPU either since Max pooling on CPU seems to not like the data format:

InvalidArgumentError (see above for traceback): Default MaxPoolingOp only supports NHWC on device type CPU [[Node: metrics/import/max_pool = MaxPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Thank you for all your work! :)

alexlee-gk commented 5 years ago

Thanks for the detailed reporting. All the mentioned issues should be fixed as of now:

Metrics shapes not fully defined. This occurred only on tf 1.9 because that version seems to have weaker static shape inference compared to tf >= 1.10. If you pull my changes, it should now work with tf 1.9.
LPIPS metric not supported on CPU. I updated the LPIPS models to support both GPU and CPU. Make sure to clear the cache of the old models: rm ~/.lpips/*.
The training doesn't get stuck for me, so I haven't changed anything to the repo in regards to that. The SAVP (i.e VAE-GAN) model trains for me on Titan X, P100, and V100 GPUs, with single and multi GPU training, python 3.5 and 3.6, tf 1.9 and 1.12, and cudnn 7.3.0.29. In my case with tf 1.12, the training script reports that images are processed at about 13.2 and 16.3 images/s when using 1 and 2 V100 GPUs, respectively. Just in case, can you make sure you pull all changes and try again? This is the command I use:
```
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --input_dir data/bair --dataset bair --model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json --output_dir logs/bair_action_free/ours_savp
```

Also, the option aggregate_nccl only matters for multi-GPU training, and it specifies how the gradients should be aggregated. Enabling it has resulted in slower training when I have tried it, so it's better to leave the default.

crequena commented 5 years ago

Hi Alex, I just followed your instructions in the previous post and training in single and multiple gpu is totally working! Thank you so much for your dedication!

alexlee-gk commented 5 years ago

That's great! I'll close this issue then. Feel free to re-open or open another one if another issue arises.

alexlee-gk / video_prediction

Training hangs when training a GAN with multiple GPUs #10