Open ErlendFax opened 6 years ago
@ErlendFax My hunch is that something about the multigpu hardware configuration on your system is not setup right. Could you run the following command to share your GPUs layout: nvidia-smi topo -m
and paste the output. Mine looks like this:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
GPU0 X SOC SOC SOC SOC 0-15
GPU1 SOC X PHB PHB PHB 16-31
GPU2 SOC PHB X PIX PHB 16-31
GPU3 SOC PHB PIX X PHB 16-31
mlx5_0 SOC PHB PHB PHB X
Legend:
X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
Try with single GPU on each GPU separately and verify it works fine for other GPUs:
CUDA_VISIBLE_DEVICES=1 python mnist_tfrecord_mgpu.py
CUDA_VISIBLE_DEVICES=2 python mnist_tfrecord_mgpu.py
CUDA_VISIBLE_DEVICES=3 python mnist_tfrecord_mgpu.py
Try using 2 GPUs and vary which pairs you're using. You should observe some speedup:
CUDA_VISIBLE_DEVICES=0,1 python mnist_tfrecord_mgpu.py
CUDA_VISIBLE_DEVICES=0,2 python mnist_tfrecord_mgpu.py
CUDA_VISIBLE_DEVICES=1,2 python mnist_tfrecord_mgpu.py
I'm running on Tesla GPUs and I don't have access to a workstation setup with multi GeForce GPUs so it's hard for me to test. But refer to this link for an example and issues with GeForce setup:
https://github.com/rossumai/keras-multi-gpu/blob/master/blog/docs/hardware.md
Maybe @bzamecnik could offer some advise b/c he also ran these on GeForce multigpu workstations. Also, if you can post results of CUDA SDK utilties:
bandwidthTest - pure unidirectional bandwidth on single GPU
p2pBandwidthLatencyTest - combination of uni/bi-direction communication between pairs
I'm using docker container: tensorflow/tensorflow:1.3.0-devel-gpu
via nvidia-docker. My results on P40 4 GPU machine:
# CUDA_VISIBLE_DEVICES=0 python ./examples/mnist/mnist_tfrecord_mgpu.py
Epoch 1/5
429/429 [==============================] - 5s - loss: 0.2050 - acc: 0.9378
Epoch 2/5
429/429 [==============================] - 3s - loss: 0.0768 - acc: 0.9773
Epoch 3/5
429/429 [==============================] - 3s - loss: 0.0629 - acc: 0.9820
Epoch 4/5
429/429 [==============================] - 3s - loss: 0.0528 - acc: 0.9845
Epoch 5/5
429/429 [==============================] - 3s - loss: 0.0477 - acc: 0.9863
[TRAINING] finished in 19975 ms
# CUDA_VISIBLE_DEVICES=0,1 python ./examples/mnist/mnist_tfrecord_mgpu.py
Epoch 1/5
214/214 [==============================] - 3s - loss: 0.2845 - acc: 0.9157
Epoch 2/5
214/214 [==============================] - 2s - loss: 0.0795 - acc: 0.9761
Epoch 3/5
214/214 [==============================] - 2s - loss: 0.0601 - acc: 0.9819
Epoch 4/5
214/214 [==============================] - 2s - loss: 0.0523 - acc: 0.9846
Epoch 5/5
214/214 [==============================] - 2s - loss: 0.0475 - acc: 0.9863
[TRAINING] finished in 12697 ms
# CUDA_VISIBLE_DEVICES=0,1,2,3 python ./examples/mnist/mnist_tfrecord_mgpu.py
Epoch 1/5
107/107 [==============================] - 4s - loss: 0.5218 - acc: 0.8619
Epoch 2/5
107/107 [==============================] - 1s - loss: 0.1063 - acc: 0.9684
Epoch 3/5
107/107 [==============================] - 1s - loss: 0.0822 - acc: 0.9758
Epoch 4/5
107/107 [==============================] - 1s - loss: 0.0697 - acc: 0.9795
Epoch 5/5
107/107 [==============================] - 1s - loss: 0.0577 - acc: 0.9829
[TRAINING] finished in 10457 ms
Some GPU bandwidth measurement results are at: https://github.com/rossumai/keras-multi-gpu/tree/master/experiments/gpu_bandwidth/cuda_samples.
Btw: we released a summary of the github articles Towards Efficient Multi-GPU Training in Keras with TensorFlow at Medium.
Thank you for your response @avolkov1 and @bzamecnik! Highly appreciated.
nvidia-smi topo -m
gives:
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SOC SOC 0-15
GPU1 PHB X SOC SOC 0-15
GPU2 SOC SOC X PHB 0-15
GPU3 SOC SOC PHB X 0-15
Legend:
X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
@bzamecnik, I tried to run your measure_gpu_bandwidth.sh script, but it froze at p2p.
So instead I ran ./p2pBandwidthLatencyTest
which gave me (it froze at Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
):
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1080 Ti, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 43, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 351.91 5.65 5.65 5.65
1 5.16 351.91 5.72 5.73
2 5.66 5.66 352.55 5.64
3 5.73 5.73 5.15 350.97
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
When I run ./bandwidthTest 0 1 2 3
(only one of the devices appears, is that a problem?):
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 1080 Ti
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6515.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5801.7
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 345280.5
Result = PASS
I'm able to run mnist_tfrecord_mgpu.py
separately on each of the GPUs.
For example: CUDA_VISIBLE_DEVICES=2 python mnist_tfrecord_mgpu.py
returns:
Epoch 1/5
1/429 [..............................] - ETA: 9:15 - loss: 2.3242 - acc: 0.085 8/429 [..............................] - ETA: 1:10 - loss: 1.7712 - acc: 0.408 16/429 [>.............................] - ETA: 36s - loss: 1.2759 - acc: 0.5869429/429 [==============================] - ETA: 0s - loss: 0.2036 - acc: 0.9387
Epoch 2/5
429/429 [==============================] - ETA: 0s - loss: 0.0767 - acc: 0.9775
Epoch 3/5
429/429 [==============================] - ETA: 0s - loss: 0.0618 - acc: 0.9818
Epoch 4/5
429/429 [==============================] - ETA: 0s - loss: 0.0531 - acc: 0.9850
Epoch 5/5
429/429 [==============================] - ETA: 0s - loss: 0.0469 - acc: 0.9866
[TRAINING] finished in 15443 ms
Also, I tried to run two and two separately, is was a bit faster than four, but acc. was still bad:
26/214 [==>...........................] - ETA: 3:15 - loss: 2.3124 - acc: 0.095
Do you think its a hardware configuration problem?
@ErlendFax Not sure. Are you running the latest version of mnist_tfrecord_mgpu.py
. It's a misnomer too because it's not actually using TFRecords to load data. Instead it's loading from numpy arrays. Snippet from that code:
from tensorflow.contrib.learn.python.learn.datasets import mnist
data = mnist.load_mnist() # load numpy arrays
x_train_batch, y_train_batch = tf.train.shuffle_batch(
tensors=[data.train.images, data.train.labels.astype(np.int32)],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=enqueue_many,
num_threads=8)
Could you run the cifar10 example. Does that work? Is is any faster for multiple GPUs and does the accuracy make sense?
CUDA_VISIBLE_DEVICES=0 python ./examples/cifar/cifar10_cnn_mgpu.py --mgpu --epochs=5 # 1 GPU
CUDA_VISIBLE_DEVICES=0,1 python ./examples/cifar/cifar10_cnn_mgpu.py --mgpu --epochs=5 # 2 GPUs
I have a corresponding tfqueue based example with Cifar10: cifar10_cnn_mgpu_tfqueue.py
Maybe try that one also. Let me know if there are issues with Cifar10 examples.
@avolkov1 Since data.train.images
is a numpy array and tf.train.shuffle_batch()
takes tensor as input, it's possible it will bake the whole numpy array into the graph as a constant. For input from a Python generator there seems to be tf.contrib.training.python_input() function. However, since TF 1.4 it's superseded by Dataset.from_generator()
in the Dataset API (not available in TF 1.3 yet).
Another proof-of-concept step: https://gist.github.com/bzamecnik/3c2b5279a5949d694421d7cfbe813557
Inputs to StagingArea provided via Variable assignment which can be decoupled from Keras' Session.run(). The pipelining logic is encapsulated in a callback. It's able to slice an input numpy array to batches.
Still it relies on feed_dict, but we don't have to hack with Keras inputs. It still needs to take care of validation set/prediction etc. And it also assumes one feature input and one label output.
@bzamecnik Nice! You probably meant to post the comment to issue #2. Please re-post there if you don't mind. Thanks. I've been playing with Horovod, and it's good if you require synchronization of gradients, but asynchronous training is faster. Getting StagingArea working with multigpu support would be awesome.
Aah, yeah. It's after midnight and my head is a bit tired. Comment re-posted. :) Good to hear about trying Horovod. I'd be glad to hear about your experience.
@avolkov1 I did as you said, checked if mnist_tfrecord_mgpu.py
was up to date and it was. When running both of the two CUDA_VISIBLE_DEVICES=0,1 python ./examples/cifar/cifar10_cnn_mgpu.py --mgpu --epochs=5
programs I get the same issue, and still single GPU works perfectly.
I tried a few other Keras / TF multi GPU APIs and the same issue shows up (make_parallel(), training_utils.py).
Annoying...
After multi-GPU program attempts I have have to restart the terminal window before I can run anything else on the GPUs. If I don't it returns:
Is this normal ?
2017-10-27 10:31:55.600499: W tensorflow/core/common_runtime/bfc_allocator.cc:277] *_*_************************************************************************************xxxxxxxxxxxx
2017-10-27 10:31:55.600524: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[2304,512]
Traceback (most recent call last):
File "cifar10_cnn_mgpu_tfqueue.py", line 378, in <module>
main()
File "cifar10_cnn_mgpu_tfqueue.py", line 314, in main
target_tensors=[y_train_batch])
File "/home/kyb/Documents/keras_experiments/keras_exp/multigpu/_multigpu.py", line 390, in compile
self._run_initsync()
File "/home/kyb/Documents/keras_experiments/keras_exp/multigpu/_multigpu.py", line 413, in _run_initsync
sess.run(init_op)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[512]
[[Node: dense_1/bias/Assign = Assign[T=DT_FLOAT, _class=["loc:@dense_1/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/bias, dense_1/Const)]]
Caused by op u'dense_1/bias/Assign', defined at:
File "cifar10_cnn_mgpu_tfqueue.py", line 378, in <module>
main()
File "cifar10_cnn_mgpu_tfqueue.py", line 297, in main
filepath if checkpt_flag else None)
File "cifar10_cnn_mgpu_tfqueue.py", line 99, in make_model
model.add(KL.Dense(512))
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/models.py", line 475, in add
output_tensor = layer(self.outputs[0])
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 576, in __call__
self.build(input_shapes[0])
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/core.py", line 836, in build
constraint=self.bias_constraint)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 400, in add_weight
constraint=constraint)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 379, in variable
v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 199, in __init__
expected_shape=expected_shape)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 320, in _init_from_args
validate_shape=validate_shape).op
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 274, in assign
validate_shape=validate_shape)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 43, in assign
use_locking=use_locking, name=name)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[512]
[[Node: dense_1/bias/Assign = Assign[T=DT_FLOAT, _class=["loc:@dense_1/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/bias, dense_1/Const)]]
And just to be sure, I'm not using a SLI Bridge, I shouldn't need one right ?
When I run nvidia-smi topo -m
I see that our CPU Affinity is a bit different, does it matter ?
It seems like there is a problem with p2pbandwithTest, isn't that a problem ?
Do you have any other suggestions ? I have a big Keras program I really want to run on multiple GPUs...
@ErlendFax No, you don't need SLI. I think the CPU Affinity is fine. Yea, maybe the p2pbandwithTest is revealing some issue.
One more suggestion is try running cifar10_multi_gpu_train.py
. I have it in my repo. It's a copy of Google's cifar10_multi_gpu_train.py
example. Run it like so:
python cifar10_multi_gpu_train.py --num_gpus=2
And try it with different number of GPUs. If you have issues with that example, then it's something related to your hardware/software maybe. I hate to suggest this, but you could try a newer driver 387.12: http://www.nvidia.com/download/driverResults.aspx/125399/en-us
This is a CUDA driver as well: http://us.download.nvidia.com/XFree86/Linux-x86_64/387.12/README/installedcomponents.html
Installing Nvidia driver's on Linux is a pain, but if the cifar10_multi_gpu_train.py
doesn't run for you on multiple GPUs, maybe try a newer driver. Let me know if the cifar10_multi_gpu_train.py
works with multiple GPUs.
@avolkov1 I ran cifar10_multi_gpu_train.py
and I think it worked. Examples/Sec increased with numbers of GPUs. Sec/Batch increased from 1 GPU to multiple, but did not increase from two to three or four. Loss is decreasing.
I didn't wait for the sessions to finish, but I think it worked.
I tried to take a look at this script and compare it others that didn't work, any suggestions why this one worked?
Anyways, I will have a deeper look into cifar10_multi_gpu_train.py
and use it as reference when building my own.
I might also updated the drivers.
Thank you once again! :+1:
I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don't change.
I'm having issues using multiple GPUs. When I use
CUDA_VISIBLE_DEVICES=0 python mnist_tfrecord_mgpu.py
everything works fine. When I useCUDA_VISIBLE_DEVICES=0,1,2,3 python mnist_tfrecord_mgpu.py
everything seems normal, no printed errors, but training time has increased dramatically, and loss and accuracy barely change (if changed at all).TensorFlow 1.3.0
Keras 2.0.8
Cuda 8.0.61
cuDNN 6.0.21
Anyone else experiencing similar behavior? Looks like it's not even using all four GPUs.
Console output when training with four GPUs:
(...)
nvidia-smi
: