Closed loretoparisi closed 5 years ago
Hi @loretoparisi ,
Thanks for reporting this with good details.
First I need to confirm one thing, how do you use SageMaker with the codes above? Especially how do you use tensorflow 1.12.0 with SageMaker? Now the latest SageMaker TensorFlow containers are built with TensorFlow 1.11.0. See README here: https://github.com/aws/sagemaker-python-sdk/blob/master/README.rst#tensorflow-sagemaker-estimators
@yangaws hello, so our code is in a docker container that runs on SageMaker TrainingJob. The docker image is from tensorflow/tensorflow:latest-gpu
. So tensorflow version was 1.12.0
, gpu version, while Keras is 2.1.6
(not 2.2.0 as mentioned above, my fault).
The main problem here is that we have any logging from the Training Job before it hangs. We have tried different approaches, the first was adapted from this SF question SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_model
The latter approach overrides the multi_gpu_model
method, since the problem seems to be related to the current implementation within Keras that causes an issue when slicing the data through the GPU devices - specifically due to this import https://github.com/keras-team/keras/commit/d059890d0342955e968fdf97b5a90d19c9d68b4e
See for more details https://github.com/keras-team/keras/issues/8123#issuecomment-354857044
[UPDATE]
To be sure of the exceptions and error, we did an override of excepthook
so that we could capture every runtime error
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
def trapUncaughtException(exctype, value, tb):
print 'My Error Information'
print 'Type:', exctype
print 'Value:', value
print 'Traceback:', tb
def installUncaughtException(handler):
sys.excepthook = handler
Despited of this, it seems that there is any logging in CloudWatch that could drive us to a possibile solution.
Could you provide some information about the model? What kind of model is this?
@cavdard it's basically a variation of this CNN https://github.com/keunwoochoi/music-auto_tagging-keras/tree/master/compact_cnn
@cavdard this may help. I have slightly modified the tf.session
code adding some initializers
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
and now at least I can see that one GPU (I assume gpu:0
) is used from the instance metrics 👍 I will investigate if this may now help to make the multi-gpu to work (since I can see the gpu loaded in tensorflow, I cannot be sure that more than one GPU is in use at this time without detailed logging...)
Hope this help other dev.
I apologize for the frustrating experience and delayed response.
For reference to others with the same issue:
https://forums.aws.amazon.com/thread.jspa?messageID=881541 https://forums.aws.amazon.com/thread.jspa?messageID=881540 https://stackoverflow.com/questions/53488870/sagemaker-fails-when-using-multi-gpu-with-keras-utils-multi-gpu-model/53754450#53754450
The discussion will continue in the stackoverflow post.
@ChoiByungWook Okay I'm then closing the issue here and let's go on on SF.
Please fill out the form below.
System Information
Describe the problem
Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration.
Minimal repro / logs
This parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.
According to SageMaker documentation and tutorials the
multi_gpu_model
utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.In the following some logging before the TrainingJob hangs. This logging repeats twice
Before there is some logging info about each GPU, that repeats 4 times
According to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.
Looking at CloudWatch logging I can see some metrics at work. Specifically
GPU Memory Utilization
,CPU Utilization
are ok, whileGPU utilization
is 0%.I have posted the question to the forum here.