Closed anti-machinee closed 2 years ago
hi @anti-machinee; thanks for posting your issue. Our first recommendation is to try to reduce your Batch Size for your run, start small and work your way back up. For us to be able to help you more, can you please share the model you are running, the execution command (including the batch size) and the AMI that you are using on DL1 to run your experiment?
@greg-serochi I reduces batch size and it works. I am using ArcFace network (insightface) Here is my execution command:
I reduce batch size from 256 to 128 so that Input of model is just (128, 3, 112, 112). But the memory of a Gaudi (The server has 8 Gaudi hardware) is ~ 32GB. Is that normal if training process is crashed because of batch size 256? I think it is abnormal because the input size is not too much
Hi @anti-machinee, i'm not sure the full execution command line was posted, can you please repost? We'd like to see if you are running the model on all 8 Gaudi or just a single Gaudi
@greg-serochi Sorry, I missed command. Here is my command, it works fine. Thank you
mpirun -n 8 --bind-to core --map-by slot:PE=6 --rank-by core --report-bindings --allow-run-as-root /usr/bin/python3.8 train.py --script train.py --device-n hpu --dl-worker-type MP --run-lazy-mode True --dist True --nproc-per-node 8 --world-size 8 --dist-url env:// --batch-size 128 --workers 12
Hi, I am using HabanaAI to train model on Gaudi hardware (DL1 - AWS). It works fine util loss.backward() and here is the error
Can any one help me ? Thank you