NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.61k stars 3.24k forks source link

[EfficientNetV2/Tensorflow2] oom during training #1175

Open ZJLi2013 opened 2 years ago

ZJLi2013 commented 2 years ago

I'm using the training script from https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v2/S/training/AMP/convergence_8xA100.sh on my A100-80G node, no changes of parameters

I am getting lot of errors about

7: [1,5]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
7: [1,5]<stderr>:    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
7: [1,5]<stderr>:tensorflow.python.framework.errors_impl.ResourceExhaustedError:  Out of memory while trying to allocate 22542867840 bytes.
7: [1,5]<stderr>:        [[{{node cluster_0_1/xla_run}}]]
7: [1,5]<stderr>:Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

it looks the default batch size (460) can't be fitted

expected behavior

without any change of the official AMP/convergence_8xA100.sh script, should can run successfully in a A100-80G node

Environment Please provide at least: I am using the Dockerfile from the repo, image looks all right

Thanks ZJ

ntajbakhsh commented 2 years ago

I'm an engineer from Nvidia. I'm not able to reproduce this issue on my end. Basically, I built the image by bash scripts/docker/build.sh and then was able to start training within the container. This issue happens for you in the first stage of training where 171x171 images are used for training or in the last stage of training where full resolution images are in use?

ntajbakhsh commented 2 years ago

I further tried to reproduce the issue you had: If I run the script with sbatch => it works; if I run it on an interactive node => it runs into OOM; if I run it on a lana80h machine => it still works. Do you perhaps run the code on an interactive node?