carpedm20 / DCGAN-tensorflow

A tensorflow implementation of "Deep Convolutional Generative Adversarial Networks"
http://carpedm20.github.io/faces/
MIT License
7.15k stars 2.63k forks source link

Run on Horovod #330

Open benckx opened 5 years ago

benckx commented 5 years ago

I've tried to adapt the model to run on Horovod: https://github.com/benckx/DCGAN-tensorflow/blob/run_on_horovod/model.py

I'm getting the following error. It doesn't happen when hooks is None or when I comment sampling (sess.run(self.sampler)):

(tensorflow4) benoit@farm:~/DCGAN-tensorflow$ python3 main.py --epoch 50 --dataset images-100x150 --grid_width 5 --grid_height 5 --sample_rate 1 --train
{'batch_size': <absl.flags._flag.Flag object at 0x7f8046d01cc0>,
 'beta1': <absl.flags._flag.Flag object at 0x7f8046d01128>,
 'checkpoint_dir': <absl.flags._flag.Flag object at 0x7f8046c9a198>,
 'crop': <absl.flags._flag.BooleanFlag object at 0x7f8046c9a358>,
 'dataset': <absl.flags._flag.Flag object at 0x7f8046c9a080>,
 'epoch': <absl.flags._flag.Flag object at 0x7f805372a0b8>,
 'generate_test_images': <absl.flags._flag.Flag object at 0x7f8046c9a470>,
 'grid_height': <absl.flags._flag.Flag object at 0x7f8046d01d30>,
 'grid_width': <absl.flags._flag.Flag object at 0x7f8046d01dd8>,
 'h': <tensorflow.python.platform.app._HelpFlag object at 0x7f80425aa518>,
 'help': <tensorflow.python.platform.app._HelpFlag object at 0x7f80425aa518>,
 'helpfull': <tensorflow.python.platform.app._HelpfullFlag object at 0x7f8042539898>,
 'helpshort': <tensorflow.python.platform.app._HelpshortFlag object at 0x7f8042539cf8>,
 'input_fname_pattern': <absl.flags._flag.Flag object at 0x7f8046c9a0f0>,
 'input_height': <absl.flags._flag.Flag object at 0x7f8046d01e80>,
 'input_width': <absl.flags._flag.Flag object at 0x7f8046d01ef0>,
 'learning_rate': <absl.flags._flag.Flag object at 0x7f804c2ae358>,
 'nbr_of_layers_d': <absl.flags._flag.Flag object at 0x7f8046c9a518>,
 'nbr_of_layers_g': <absl.flags._flag.Flag object at 0x7f8046c9a5c0>,
 'output_height': <absl.flags._flag.Flag object at 0x7f8046d01f60>,
 'output_width': <absl.flags._flag.Flag object at 0x7f8046d01fd0>,
 'sample_dir': <absl.flags._flag.Flag object at 0x7f8046c9a208>,
 'sample_rate': <absl.flags._flag.Flag object at 0x7f8046c9a278>,
 'train': <absl.flags._flag.BooleanFlag object at 0x7f8046c9a2b0>,
 'train_size': <absl.flags._flag.Flag object at 0x7f8046d01b70>,
 'use_checkpoints': <absl.flags._flag.BooleanFlag object at 0x7f8046c9a5f8>,
 'visualize': <absl.flags._flag.BooleanFlag object at 0x7f8046c9a3c8>}
init generator with 5 layers ...
WARNING:tensorflow:From /home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
init discriminator with 5 layers ...
init discriminator with 5 layers ...
---------
Variables: name (type shape) [size]
---------
generator/g_h0_lin/Matrix:0 (float32_ref 100x35840) [3584000, bytes: 14336000]
generator/g_h0_lin/bias:0 (float32_ref 35840) [35840, bytes: 143360]
generator/g_bn0/beta:0 (float32_ref 512) [512, bytes: 2048]
generator/g_bn0/gamma:0 (float32_ref 512) [512, bytes: 2048]
generator/g_h1/w:0 (float32_ref 5x5x256x512) [3276800, bytes: 13107200]
generator/g_h1/biases:0 (float32_ref 256) [256, bytes: 1024]
generator/g_bn1/beta:0 (float32_ref 256) [256, bytes: 1024]
generator/g_bn1/gamma:0 (float32_ref 256) [256, bytes: 1024]
generator/g_h2/w:0 (float32_ref 5x5x128x256) [819200, bytes: 3276800]
generator/g_h2/biases:0 (float32_ref 128) [128, bytes: 512]
generator/g_bn2/beta:0 (float32_ref 128) [128, bytes: 512]
generator/g_bn2/gamma:0 (float32_ref 128) [128, bytes: 512]
generator/g_h3/w:0 (float32_ref 5x5x64x128) [204800, bytes: 819200]
generator/g_h3/biases:0 (float32_ref 64) [64, bytes: 256]
generator/g_bn3/beta:0 (float32_ref 64) [64, bytes: 256]
generator/g_bn3/gamma:0 (float32_ref 64) [64, bytes: 256]
generator/g_h4/w:0 (float32_ref 5x5x3x64) [4800, bytes: 19200]
generator/g_h4/biases:0 (float32_ref 3) [3, bytes: 12]
discriminator/d_h0_conv/w:0 (float32_ref 5x5x3x64) [4800, bytes: 19200]
discriminator/d_h0_conv/biases:0 (float32_ref 64) [64, bytes: 256]
discriminator/d_h1_conv/w:0 (float32_ref 5x5x64x128) [204800, bytes: 819200]
discriminator/d_h1_conv/biases:0 (float32_ref 128) [128, bytes: 512]
discriminator/d_bn1/beta:0 (float32_ref 128) [128, bytes: 512]
discriminator/d_bn1/gamma:0 (float32_ref 128) [128, bytes: 512]
discriminator/d_h2_conv/w:0 (float32_ref 5x5x128x256) [819200, bytes: 3276800]
discriminator/d_h2_conv/biases:0 (float32_ref 256) [256, bytes: 1024]
discriminator/d_bn2/beta:0 (float32_ref 256) [256, bytes: 1024]
discriminator/d_bn2/gamma:0 (float32_ref 256) [256, bytes: 1024]
discriminator/d_h3_conv/w:0 (float32_ref 5x5x256x512) [3276800, bytes: 13107200]
discriminator/d_h3_conv/biases:0 (float32_ref 512) [512, bytes: 2048]
discriminator/d_bn3/beta:0 (float32_ref 512) [512, bytes: 2048]
discriminator/d_bn3/gamma:0 (float32_ref 512) [512, bytes: 2048]
discriminator/d_h4_lin/Matrix:0 (float32_ref 35840x1) [35840, bytes: 143360]
discriminator/d_h4_lin/bias:0 (float32_ref 1) [1, bytes: 4]
Total size of variables: 12272004
Total bytes of variables: 49088016
2019-04-08 17:26:13.688741: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-08 17:26:14.307055: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-08 17:26:14.342296: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-08 17:26:14.354569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-08 17:26:14.365357: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-08 17:26:14.379888: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-08 17:26:14.381380: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x44d7f50 executing computations on platform CUDA. Devices:
2019-04-08 17:26:14.381393: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2019-04-08 17:26:14.381399: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): GeForce GTX 1070, Compute Capability 6.1
2019-04-08 17:26:14.381404: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (2): GeForce GTX 1070, Compute Capability 6.1
2019-04-08 17:26:14.381408: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (3): GeForce GTX 1070, Compute Capability 6.1
2019-04-08 17:26:14.381413: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (4): GeForce GTX 1070, Compute Capability 6.1
2019-04-08 17:26:14.400020: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-04-08 17:26:14.400368: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2a05fb0 executing computations on platform Host. Devices:
2019-04-08 17:26:14.400393: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-08 17:26:14.400630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7465
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.80GiB
2019-04-08 17:26:14.400648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-08 17:26:14.406968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-08 17:26:14.406980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-04-08 17:26:14.406987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-04-08 17:26:14.407172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7587 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-04-08 17:26:18.194280: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Epoch: [ 0] [   0/ 351] time: 7.3941
[Sample] d_loss: 0.53202271, g_loss: 2.22091031
Epoch: [ 0] [   1/ 351] time: 9.3274
[Sample] d_loss: 1.45547330, g_loss: 1.22654331
Traceback (most recent call last):
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'real_images' with dtype float and shape [25,150,100,3]
     [[{{node real_images}}]]
     [[{{node global_step}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 145, in <module>
    tf.app.run()
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 128, in main
    dcgan.train(FLAGS)
  File "/home/benoit/DCGAN-tensorflow/model.py", line 304, in train
    feed_dict={ self.z: batch_z })
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'real_images' with dtype float and shape [25,150,100,3]
     [[node real_images (defined at /home/benoit/DCGAN-tensorflow/model.py:111) ]]
     [[node global_step (defined at /home/benoit/DCGAN-tensorflow/model.py:159) ]]

Caused by op 'real_images', defined at:
  File "main.py", line 145, in <module>
    tf.app.run()
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 123, in main
    use_checkpoints=FLAGS.use_checkpoints)
  File "/home/benoit/DCGAN-tensorflow/model.py", line 97, in __init__
    self.build_model()
  File "/home/benoit/DCGAN-tensorflow/model.py", line 111, in build_model
    tf.float32, [self.batch_size] + image_dims, name='real_images')
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2077, in placeholder
    return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5791, in placeholder
    "Placeholder", dtype=dtype, shape=shape, name=name)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/benoit/tensorflow4/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'real_images' with dtype float and shape [25,150,100,3]
     [[node real_images (defined at /home/benoit/DCGAN-tensorflow/model.py:111) ]]
     [[node global_step (defined at /home/benoit/DCGAN-tensorflow/model.py:159) ]]

I guess it's related to the fact that since d_optim and g_optim are configured to be distributed:

  d_optim = hvd.DistributedOptimizer(tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1)).minimize(
    self.d_loss, var_list=self.d_vars, global_step=global_step)
  g_optim = hvd.DistributedOptimizer(tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1)).minimize(
    self.g_loss, var_list=self.g_vars, global_step=global_step)

I can do this:

  sess.run([d_optim, self.d_sum], feed_dict={self.inputs: batch_images, self.z: batch_z, self.y:batch_labels})
  sess.run([g_optim, self.g_sum], feed_dict={self.z: batch_z, self.y:batch_labels})

But self.sampler is not "sync" with Horovod, so this triggers some problems:

sess.run([self.sampler, self.d_loss, self.g_loss], feed_dict={self.z: sample_z, self.inputs: sample_inputs, self.y:sample_labels})

So I suppose it's not a bug, but something in the model. I wonder if anyone has tried something like that before.

Environment:

  1. Framework: TensorFlow
  2. Framework version: 1.13.1
  3. Horovod version: 0.16.1
  4. MPI version: 4.0.1
  5. CUDA version: 10.0
  6. NCCL version:
  7. Python version: 3.6.7
  8. OS and version: Ubuntu 18.04 / Linux Mint 19.1
  9. Driver version: 410.104
benckx commented 5 years ago

The guys from Horovod showed me how to fix the issue. If you're interested: https://github.com/horovod/horovod/issues/997