HabanaAI / Model-References

Reference models for Intel(R) Gaudi(R) AI Accelerator
141 stars 67 forks source link

Error in loss.backward() #10

Closed anti-machinee closed 2 years ago

anti-machinee commented 2 years ago

Hi, I am using HabanaAI to train model on Gaudi hardware (DL1 - AWS). It works fine util loss.backward() and here is the error

File "/FaceRecognition/trainer/trainer.py", line 70, in _train_epoch
    loss.backward()
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: FATAL ERROR :: MODULE:SYNHELPER workspace Allocation of size ::14493510400 failed!

Can any one help me ? Thank you

greg-serochi commented 2 years ago

hi @anti-machinee; thanks for posting your issue. Our first recommendation is to try to reduce your Batch Size for your run, start small and work your way back up. For us to be able to help you more, can you please share the model you are running, the execution command (including the batch size) and the AMI that you are using on DL1 to run your experiment?

anti-machinee commented 2 years ago

@greg-serochi I reduces batch size and it works. I am using ArcFace network (insightface) Here is my execution command:

I reduce batch size from 256 to 128 so that Input of model is just (128, 3, 112, 112). But the memory of a Gaudi (The server has 8 Gaudi hardware) is ~ 32GB. Is that normal if training process is crashed because of batch size 256? I think it is abnormal because the input size is not too much

greg-serochi commented 2 years ago

Hi @anti-machinee, i'm not sure the full execution command line was posted, can you please repost? We'd like to see if you are running the model on all 8 Gaudi or just a single Gaudi

anti-machinee commented 2 years ago

@greg-serochi Sorry, I missed command. Here is my command, it works fine. Thank you mpirun -n 8 --bind-to core --map-by slot:PE=6 --rank-by core --report-bindings --allow-run-as-root /usr/bin/python3.8 train.py --script train.py --device-n hpu --dl-worker-type MP --run-lazy-mode True --dist True --nproc-per-node 8 --world-size 8 --dist-url env:// --batch-size 128 --workers 12