Open ZJLi2013 opened 2 years ago
I'm an engineer from Nvidia. I'm not able to reproduce this issue on my end. Basically, I built the image by bash scripts/docker/build.sh
and then was able to start training within the container. This issue happens for you in the first stage of training where 171x171 images are used for training or in the last stage of training where full resolution images are in use?
I further tried to reproduce the issue you had: If I run the script with sbatch => it works; if I run it on an interactive node => it runs into OOM; if I run it on a lana80h machine => it still works. Do you perhaps run the code on an interactive node?
I'm using the training script from https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v2/S/training/AMP/convergence_8xA100.sh on my A100-80G node, no changes of parameters
I am getting lot of errors about
it looks the default batch size (460) can't be fitted
expected behavior
without any change of the official AMP/convergence_8xA100.sh script, should can run successfully in a A100-80G node
Environment Please provide at least: I am using the Dockerfile from the repo, image looks all right
Thanks ZJ