aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_model #512

Closed loretoparisi closed 5 years ago

loretoparisi commented 5 years ago

Please fill out the form below.

System Information

Describe the problem

Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration.

Minimal repro / logs

 def setup_multi_gpu(model):

    import tensorflow as tf
    from keras.utils.training_utils import multi_gpu_model
    from tensorflow.python.client import device_lib

    # IMPORTANT: Tells tf to not occupy a specific amount of memory
    from keras.backend.tensorflow_backend import set_session

    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
    sess = tf.Session(config=config)
    set_session(sess)  # set this TensorFlow session as the default session for Keras.

    print('reading gpus avaliable..')
    local_device_protos = device_lib.list_local_devices()
    avail_gpus = [x.name for x in local_device_protos if x.device_type == 'GPU']
    num_gpu = len(avail_gpus)
    print('Amount of GPUs available: %s' % num_gpu)

    multi_model = multi_gpu_model(model, gpus=num_gpu)

    return multi_model 

_model = create_model()
model = setup_multi_gpu(_model)
model.compile(params)
model.train(params)

This parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.

According to SageMaker documentation and tutorials the multi_gpu_model utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.

In the following some logging before the TrainingJob hangs. This logging repeats twice

2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)

Before there is some logging info about each GPU, that repeats 4 times

2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 15.37GiB

According to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.

schermata 2018-11-27 alle 11 15 04

Looking at CloudWatch logging I can see some metrics at work. Specifically GPU Memory Utilization, CPU Utilization are ok, while GPU utilization is 0%.

enter image description here

I have posted the question to the forum here.

yangaws commented 5 years ago

Hi @loretoparisi ,

Thanks for reporting this with good details.

First I need to confirm one thing, how do you use SageMaker with the codes above? Especially how do you use tensorflow 1.12.0 with SageMaker? Now the latest SageMaker TensorFlow containers are built with TensorFlow 1.11.0. See README here: https://github.com/aws/sagemaker-python-sdk/blob/master/README.rst#tensorflow-sagemaker-estimators

loretoparisi commented 5 years ago

@yangaws hello, so our code is in a docker container that runs on SageMaker TrainingJob. The docker image is from tensorflow/tensorflow:latest-gpu. So tensorflow version was 1.12.0, gpu version, while Keras is 2.1.6 (not 2.2.0 as mentioned above, my fault).

The main problem here is that we have any logging from the Training Job before it hangs. We have tried different approaches, the first was adapted from this SF question SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_model

The latter approach overrides the multi_gpu_model method, since the problem seems to be related to the current implementation within Keras that causes an issue when slicing the data through the GPU devices - specifically due to this import https://github.com/keras-team/keras/commit/d059890d0342955e968fdf97b5a90d19c9d68b4e

See for more details https://github.com/keras-team/keras/issues/8123#issuecomment-354857044

loretoparisi commented 5 years ago

[UPDATE] To be sure of the exceptions and error, we did an override of excepthook so that we could capture every runtime error

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

def trapUncaughtException(exctype, value, tb):
    print 'My Error Information'
    print 'Type:', exctype
    print 'Value:', value
    print 'Traceback:', tb

def installUncaughtException(handler):
    sys.excepthook = handler

Despited of this, it seems that there is any logging in CloudWatch that could drive us to a possibile solution.

cavdard commented 5 years ago

Could you provide some information about the model? What kind of model is this?

loretoparisi commented 5 years ago

@cavdard it's basically a variation of this CNN https://github.com/keunwoochoi/music-auto_tagging-keras/tree/master/compact_cnn

loretoparisi commented 5 years ago

@cavdard this may help. I have slightly modified the tf.session code adding some initializers

with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())

and now at least I can see that one GPU (I assume gpu:0) is used from the instance metrics 👍 I will investigate if this may now help to make the multi-gpu to work (since I can see the gpu loaded in tensorflow, I cannot be sure that more than one GPU is in use at this time without detailed logging...) Hope this help other dev.

ChoiByungWook commented 5 years ago

I apologize for the frustrating experience and delayed response.

ChoiByungWook commented 5 years ago

For reference to others with the same issue:

https://forums.aws.amazon.com/thread.jspa?messageID=881541 https://forums.aws.amazon.com/thread.jspa?messageID=881540 https://stackoverflow.com/questions/53488870/sagemaker-fails-when-using-multi-gpu-with-keras-utils-multi-gpu-model/53754450#53754450

The discussion will continue in the stackoverflow post.

loretoparisi commented 5 years ago

@ChoiByungWook Okay I'm then closing the issue here and let's go on on SF.