Multiple GPUs slower than single GPU

ErlendFax commented 6 years ago

I'm having issues using multiple GPUs. When I use CUDA_VISIBLE_DEVICES=0 python mnist_tfrecord_mgpu.py everything works fine. When I use CUDA_VISIBLE_DEVICES=0,1,2,3 python mnist_tfrecord_mgpu.py everything seems normal, no printed errors, but training time has increased dramatically, and loss and accuracy barely change (if changed at all).

TensorFlow 1.3.0
Keras 2.0.8
Cuda 8.0.61
cuDNN 6.0.21

Anyone else experiencing similar behavior? Looks like it's not even using all four GPUs.

Console output when training with four GPUs:

(...)

45/107 [===========>..................] - ETA: 119s - loss: 1.7269 - acc: 0.099 46/107 [===========>..................] - ETA: 118s - loss: 1.7269 - acc: 0.099 47/107 [============>.................] - ETA: 116s - loss: 1.7269 - acc: 0.099 48/107 [============>.................] - ETA: 114s - loss: 1.7269 - acc: 0.098 49/107 [============>.................] - ETA: 112s - loss: 1.7269 - acc: 0.098 50/107 [=============>................] - ETA: 110s - loss: 1.7269 - acc: 0.098 51/107 [=============>................] - ETA: 108s - loss: 1.7269 - acc: 0.098 52/107 [=============>................] - ETA: 106s - loss: 1.7269 - acc: 0.098 53/107 [=============>................] - ETA: 104s - loss: 1.7269 - acc: 0.098 54/107 [==============>...............] - ETA: 102s - loss: 1.7269 - acc: 0.098 55/107 [==============>...............] - ETA: 100s - loss: 1.7269 - acc: 0.098 56/107 [==============>...............] - ETA: 98s - loss: 1.7269 - acc: 0.0987107/107 [==============================] - 206s - loss: 1.7269 - acc: 0.0999
[TRAINING] finished in 1036780 ms

nvidia-smi: smi_done

avolkov1 commented 6 years ago

@ErlendFax My hunch is that something about the multigpu hardware configuration on your system is not setup right. Could you run the following command to share your GPUs layout: nvidia-smi topo -m and paste the output. Mine looks like this:

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
GPU0     X      SOC     SOC     SOC     SOC     0-15
GPU1    SOC      X      PHB     PHB     PHB     16-31
GPU2    SOC     PHB      X      PIX     PHB     16-31
GPU3    SOC     PHB     PIX      X      PHB     16-31
mlx5_0  SOC     PHB     PHB     PHB      X

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Try with single GPU on each GPU separately and verify it works fine for other GPUs:

CUDA_VISIBLE_DEVICES=1 python mnist_tfrecord_mgpu.py
CUDA_VISIBLE_DEVICES=2 python mnist_tfrecord_mgpu.py
CUDA_VISIBLE_DEVICES=3 python mnist_tfrecord_mgpu.py

Try using 2 GPUs and vary which pairs you're using. You should observe some speedup:

CUDA_VISIBLE_DEVICES=0,1 python mnist_tfrecord_mgpu.py
CUDA_VISIBLE_DEVICES=0,2 python mnist_tfrecord_mgpu.py
CUDA_VISIBLE_DEVICES=1,2 python mnist_tfrecord_mgpu.py

I'm running on Tesla GPUs and I don't have access to a workstation setup with multi GeForce GPUs so it's hard for me to test. But refer to this link for an example and issues with GeForce setup:

https://github.com/rossumai/keras-multi-gpu/blob/master/blog/docs/hardware.md

Maybe @bzamecnik could offer some advise b/c he also ran these on GeForce multigpu workstations. Also, if you can post results of CUDA SDK utilties:

bandwidthTest - pure unidirectional bandwidth on single GPU
p2pBandwidthLatencyTest - combination of uni/bi-direction communication between pairs

I'm using docker container: tensorflow/tensorflow:1.3.0-devel-gpu via nvidia-docker. My results on P40 4 GPU machine:

# CUDA_VISIBLE_DEVICES=0 python ./examples/mnist/mnist_tfrecord_mgpu.py
Epoch 1/5
429/429 [==============================] - 5s - loss: 0.2050 - acc: 0.9378
Epoch 2/5
429/429 [==============================] - 3s - loss: 0.0768 - acc: 0.9773
Epoch 3/5
429/429 [==============================] - 3s - loss: 0.0629 - acc: 0.9820
Epoch 4/5
429/429 [==============================] - 3s - loss: 0.0528 - acc: 0.9845
Epoch 5/5
429/429 [==============================] - 3s - loss: 0.0477 - acc: 0.9863
[TRAINING] finished in 19975 ms

# CUDA_VISIBLE_DEVICES=0,1 python ./examples/mnist/mnist_tfrecord_mgpu.py
Epoch 1/5
214/214 [==============================] - 3s - loss: 0.2845 - acc: 0.9157
Epoch 2/5
214/214 [==============================] - 2s - loss: 0.0795 - acc: 0.9761
Epoch 3/5
214/214 [==============================] - 2s - loss: 0.0601 - acc: 0.9819
Epoch 4/5
214/214 [==============================] - 2s - loss: 0.0523 - acc: 0.9846
Epoch 5/5
214/214 [==============================] - 2s - loss: 0.0475 - acc: 0.9863
[TRAINING] finished in 12697 ms

# CUDA_VISIBLE_DEVICES=0,1,2,3 python ./examples/mnist/mnist_tfrecord_mgpu.py
Epoch 1/5
107/107 [==============================] - 4s - loss: 0.5218 - acc: 0.8619
Epoch 2/5
107/107 [==============================] - 1s - loss: 0.1063 - acc: 0.9684
Epoch 3/5
107/107 [==============================] - 1s - loss: 0.0822 - acc: 0.9758
Epoch 4/5
107/107 [==============================] - 1s - loss: 0.0697 - acc: 0.9795
Epoch 5/5
107/107 [==============================] - 1s - loss: 0.0577 - acc: 0.9829
[TRAINING] finished in 10457 ms

bzamecnik commented 6 years ago

Some GPU bandwidth measurement results are at: https://github.com/rossumai/keras-multi-gpu/tree/master/experiments/gpu_bandwidth/cuda_samples.

Btw: we released a summary of the github articles Towards Efficient Multi-GPU Training in Keras with TensorFlow at Medium.

ErlendFax commented 6 years ago

Thank you for your response @avolkov1 and @bzamecnik! Highly appreciated.

nvidia-smi topo -m gives:

    GPU0    GPU1    GPU2    GPU3    CPU Affinity
GPU0     X  PHB SOC SOC 0-15
GPU1    PHB  X  SOC SOC 0-15
GPU2    SOC SOC  X  PHB 0-15
GPU3    SOC SOC PHB  X  0-15

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

@bzamecnik, I tried to run your measure_gpu_bandwidth.sh script, but it froze at p2p.

So instead I ran ./p2pBandwidthLatencyTest which gave me (it froze at Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)):

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1080 Ti, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 43, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     1     1
     1       1     1     1     1
     2       1     1     1     1
     3       1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 351.91   5.65   5.65   5.65 
     1   5.16 351.91   5.72   5.73 
     2   5.66   5.66 352.55   5.64 
     3   5.73   5.73   5.15 350.97 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)

When I run ./bandwidthTest 0 1 2 3 (only one of the devices appears, is that a problem?):

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1080 Ti
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         6515.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         5801.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         345280.5

Result = PASS

I'm able to run mnist_tfrecord_mgpu.py separately on each of the GPUs.

For example: CUDA_VISIBLE_DEVICES=2 python mnist_tfrecord_mgpu.py returns:

Epoch 1/5
  1/429 [..............................] - ETA: 9:15 - loss: 2.3242 - acc: 0.085  8/429 [..............................] - ETA: 1:10 - loss: 1.7712 - acc: 0.408 16/429 [>.............................] - ETA: 36s - loss: 1.2759 - acc: 0.5869429/429 [==============================] - ETA: 0s - loss: 0.2036 - acc: 0.9387 
Epoch 2/5
429/429 [==============================] - ETA: 0s - loss: 0.0767 - acc: 0.9775
Epoch 3/5
429/429 [==============================] - ETA: 0s - loss: 0.0618 - acc: 0.9818
Epoch 4/5
429/429 [==============================] - ETA: 0s - loss: 0.0531 - acc: 0.9850
Epoch 5/5
429/429 [==============================] - ETA: 0s - loss: 0.0469 - acc: 0.9866
[TRAINING] finished in 15443 ms

Also, I tried to run two and two separately, is was a bit faster than four, but acc. was still bad:

26/214 [==>...........................] - ETA: 3:15 - loss: 2.3124 - acc: 0.095

Do you think its a hardware configuration problem?

avolkov1 commented 6 years ago

@ErlendFax Not sure. Are you running the latest version of mnist_tfrecord_mgpu.py. It's a misnomer too because it's not actually using TFRecords to load data. Instead it's loading from numpy arrays. Snippet from that code:

from tensorflow.contrib.learn.python.learn.datasets import mnist

    data = mnist.load_mnist()  # load numpy arrays

    x_train_batch, y_train_batch = tf.train.shuffle_batch(
        tensors=[data.train.images, data.train.labels.astype(np.int32)],
        batch_size=batch_size,
        capacity=capacity,
        min_after_dequeue=min_after_dequeue,
        enqueue_many=enqueue_many,
        num_threads=8)

Could you run the cifar10 example. Does that work? Is is any faster for multiple GPUs and does the accuracy make sense?

CUDA_VISIBLE_DEVICES=0 python ./examples/cifar/cifar10_cnn_mgpu.py --mgpu --epochs=5  # 1 GPU
CUDA_VISIBLE_DEVICES=0,1 python ./examples/cifar/cifar10_cnn_mgpu.py --mgpu --epochs=5  # 2 GPUs

I have a corresponding tfqueue based example with Cifar10: cifar10_cnn_mgpu_tfqueue.py Maybe try that one also. Let me know if there are issues with Cifar10 examples.

bzamecnik commented 6 years ago

@avolkov1 Since data.train.images is a numpy array and tf.train.shuffle_batch() takes tensor as input, it's possible it will bake the whole numpy array into the graph as a constant. For input from a Python generator there seems to be tf.contrib.training.python_input() function. However, since TF 1.4 it's superseded by Dataset.from_generator() in the Dataset API (not available in TF 1.3 yet).

bzamecnik commented 6 years ago

Another proof-of-concept step: https://gist.github.com/bzamecnik/3c2b5279a5949d694421d7cfbe813557

Inputs to StagingArea provided via Variable assignment which can be decoupled from Keras' Session.run(). The pipelining logic is encapsulated in a callback. It's able to slice an input numpy array to batches.

Still it relies on feed_dict, but we don't have to hack with Keras inputs. It still needs to take care of validation set/prediction etc. And it also assumes one feature input and one label output.

avolkov1 commented 6 years ago

@bzamecnik Nice! You probably meant to post the comment to issue #2. Please re-post there if you don't mind. Thanks. I've been playing with Horovod, and it's good if you require synchronization of gradients, but asynchronous training is faster. Getting StagingArea working with multigpu support would be awesome.

bzamecnik commented 6 years ago

Aah, yeah. It's after midnight and my head is a bit tired. Comment re-posted. :) Good to hear about trying Horovod. I'd be glad to hear about your experience.

ErlendFax commented 6 years ago

@avolkov1 I did as you said, checked if mnist_tfrecord_mgpu.py was up to date and it was. When running both of the two CUDA_VISIBLE_DEVICES=0,1 python ./examples/cifar/cifar10_cnn_mgpu.py --mgpu --epochs=5 programs I get the same issue, and still single GPU works perfectly.

I tried a few other Keras / TF multi GPU APIs and the same issue shows up (make_parallel(), training_utils.py).

Annoying...

After multi-GPU program attempts I have have to restart the terminal window before I can run anything else on the GPUs. If I don't it returns:

Is this normal ?

2017-10-27 10:31:55.600499: W tensorflow/core/common_runtime/bfc_allocator.cc:277] *_*_************************************************************************************xxxxxxxxxxxx
2017-10-27 10:31:55.600524: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[2304,512]
Traceback (most recent call last):
  File "cifar10_cnn_mgpu_tfqueue.py", line 378, in <module>
    main()
  File "cifar10_cnn_mgpu_tfqueue.py", line 314, in main
    target_tensors=[y_train_batch])
  File "/home/kyb/Documents/keras_experiments/keras_exp/multigpu/_multigpu.py", line 390, in compile
    self._run_initsync()
  File "/home/kyb/Documents/keras_experiments/keras_exp/multigpu/_multigpu.py", line 413, in _run_initsync
    sess.run(init_op)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[512]
     [[Node: dense_1/bias/Assign = Assign[T=DT_FLOAT, _class=["loc:@dense_1/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/bias, dense_1/Const)]]

Caused by op u'dense_1/bias/Assign', defined at:
  File "cifar10_cnn_mgpu_tfqueue.py", line 378, in <module>
    main()
  File "cifar10_cnn_mgpu_tfqueue.py", line 297, in main
    filepath if checkpt_flag else None)
  File "cifar10_cnn_mgpu_tfqueue.py", line 99, in make_model
    model.add(KL.Dense(512))
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/models.py", line 475, in add
    output_tensor = layer(self.outputs[0])
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 576, in __call__
    self.build(input_shapes[0])
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/core.py", line 836, in build
    constraint=self.bias_constraint)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 400, in add_weight
    constraint=constraint)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 379, in variable
    v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 199, in __init__
    expected_shape=expected_shape)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 320, in _init_from_args
    validate_shape=validate_shape).op
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 274, in assign
    validate_shape=validate_shape)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 43, in assign
    use_locking=use_locking, name=name)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[512]
     [[Node: dense_1/bias/Assign = Assign[T=DT_FLOAT, _class=["loc:@dense_1/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/bias, dense_1/Const)]]

And just to be sure, I'm not using a SLI Bridge, I shouldn't need one right ?

When I run nvidia-smi topo -m I see that our CPU Affinity is a bit different, does it matter ?

It seems like there is a problem with p2pbandwithTest, isn't that a problem ?

Do you have any other suggestions ? I have a big Keras program I really want to run on multiple GPUs...

avolkov1 commented 6 years ago

@ErlendFax No, you don't need SLI. I think the CPU Affinity is fine. Yea, maybe the p2pbandwithTest is revealing some issue. One more suggestion is try running cifar10_multi_gpu_train.py. I have it in my repo. It's a copy of Google's cifar10_multi_gpu_train.py example. Run it like so:

python cifar10_multi_gpu_train.py --num_gpus=2

And try it with different number of GPUs. If you have issues with that example, then it's something related to your hardware/software maybe. I hate to suggest this, but you could try a newer driver 387.12: http://www.nvidia.com/download/driverResults.aspx/125399/en-us

This is a CUDA driver as well: http://us.download.nvidia.com/XFree86/Linux-x86_64/387.12/README/installedcomponents.html

Installing Nvidia driver's on Linux is a pain, but if the cifar10_multi_gpu_train.py doesn't run for you on multiple GPUs, maybe try a newer driver. Let me know if the cifar10_multi_gpu_train.py works with multiple GPUs.

ErlendFax commented 6 years ago

@avolkov1 I ran cifar10_multi_gpu_train.py and I think it worked. Examples/Sec increased with numbers of GPUs. Sec/Batch increased from 1 GPU to multiple, but did not increase from two to three or four. Loss is decreasing.

I didn't wait for the sessions to finish, but I think it worked.

I tried to take a look at this script and compare it others that didn't work, any suggestions why this one worked?

Anyways, I will have a deeper look into cifar10_multi_gpu_train.py and use it as reference when building my own.

I might also updated the drivers.

Thank you once again! :+1:

majiali1995 commented 6 years ago

I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don't change.

avolkov1 / keras_experiments

Multiple GPUs slower than single GPU #13