Not running effectively on Azure VM NC6 (56GiB RAM + 24 GiB GPU)

taurenshaman commented 7 years ago

I'm new in deep learning. I just wanna test some ideas, so I played the code on Azure VM NC6 successfully (NC6 is like a Instamatic to me ^_^). But I got some odd log.
Before the log, I should show the feature of NC6(GPU part: https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/):
NC series：NVIDIA k80 GPU. Double GPU，4992 CUDA，24GB，double:2.91TFLOPS，flout:8.73TFLOPS.
NC6：6 cores + 56GiB memory + 340GiB disk + 1X K80. $0.9/hour.

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 9909:00:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y

My testing:
Test image 1: 300x369, less than 1 second one iteration.
Test image 2: 2960x5258, OOM in iteration 1.
Then I zoom it to 1480x2629, OOM in iteration 1.
Again I zoom it to 740x1315, worked, less than 3 seconds one iteration.
All of the above has the same log part: Total memory: 11.17GiB.
The log shows that Total memory is only 11GiB. But for NC6, RAM is 56GiB, GPU is 24GiB. Neither of them is like 11GiB. I used top command, it showed that the available memory is about greater than 54GiB. So how to use the NC6 VM more effectively? Is there some configuration?

Thank you very much!

anishathalye commented 7 years ago

RAM is not useful, you need lots of GPU memory for this. Those numbers look about right for 12GB of GPU memory.

The K80 is actually 2 GPUs, with 12GB of GPU memory per GPU.

taurenshaman commented 7 years ago

Thanks for your reply. I searched and got https://www.tensorflow.org/tutorials/using_gpu There is some code in Using multiple GPUs part:

`

#Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)
#Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
#Runs the op.
print(sess.run(sum))

`

Is there a automatic way to use all available GPUs?
Maybe there is two possible performance scenarios:

gpus1 = ['/gpu:0', '/gpu:1', '/gpu:2'] #all GPUs have a same performance
gpus2 = ['/gpu:0', '/gpu:1', '/gpu:2'] #GPU performance: gpu0 > gpu1 > gpu2

Then we can use gpus1 or gpus2 in automatical way, instead of calling the names ('/gpu:0', '/gpu:1', '/gpu:2') in hard code.

I searched an answer refered to code from line 170 (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py). Is it the most automatical way to use all GPUs?

My scenario is to process pictures created by mobile in original resolution ratio. I must prepare for 4K pictures. So I must test out the best performance of NC series:

NC6：6 cores+56GiB RAM+340GiB disk+1X K80。$0.9/小时.
NC12：12 cores+112GiB RAM+680GiB disk+2X K80。$1.8/小时.
NC24：24 cores+224GiB RAM+1440GiB disk+4X K80。$3.6/小时.

If ignore image resolution ratio, just use gpu_id = iterations % gpu_num will be the simplest way. But as I mentioned in the issue (testing log), if the image has a large resolution ratio, OOM will occur in iteration 1.
Can you give me some advice?

Thank you very much!

anishathalye commented 7 years ago

It should be possible to split the model between two GPUs, though performance is probably not going to be great because you'll need data transfer between the parts on each forward and backward pass.

It would be cool if you're interested in implementing and benchmarking multi-GPU support.

On Jul 6, 2017, at 12:05 AM, Jerin notifications@github.com wrote:

Thanks for your reply. I searched and got https://www.tensorflow.org/tutorials/using_gpu There is some code in Using multiple GPUs part: `# Creates a graph. c = [] for d in ['/gpu:2', '/gpu:3']: with tf.device(d): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c.append(tf.matmul(a, b)) with tf.device('/cpu:0'): sum = tf.add_n(c)

Creates a session with log_device_placement set to True.

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Runs the op.

print(sess.run(sum))`

Is there a automatic way to use all available GPUs? Maybe there is two possible performance scenarios:

gpus1 = ['/gpu:0', '/gpu:1', '/gpu:2'] # all GPUs have a same performance gpus2 = ['/gpu:0', '/gpu:1', '/gpu:2'] # GPU performance: gpu0 > gpu1 > gpu2 Then we can use gpus1 or gpus2 in automatical way, instead of calling the names ('/gpu:0', '/gpu:1', '/gpu:2') in hard code. I searched an answer refered to code from line 170 (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py). Is it the most automatical way to use all GPUs?

Thank you!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

taurenshaman commented 7 years ago

So there will be no improvement between 1 K80 and 2 K80, if no code updates. -_-||| Thank you very much!

anishathalye commented 7 years ago

Yup - I don't have the bandwidth to work on this right now, unfortunately.

anishathalye / neural-style

Not running effectively on Azure VM NC6 (56GiB RAM + 24 GiB GPU) #91