HabanaAI / Model-References

Reference models for Intel(R) Gaudi(R) AI Accelerator
141 stars 67 forks source link

What really memory of a single gaudi is #13

Closed anti-machinee closed 2 years ago

anti-machinee commented 2 years ago

I run a model has 50M parameters and try with different batch size, one is 96 and other is 128. Server uses bs = 128 is crashed and work fine with bs = 96. My 1080 has 12GB memory could handle batch size 96 with same parameter, but a signle gaudi has 32 GB. Please support me, thank you @greg-serochi

greg-serochi commented 2 years ago

hi @anti-machinee; may not be a simple answer; this may depend on the model and where it's running (CPU vs. Gaudi)

For us to provide a good answer here, we need your help with some additional info:

  1. What is the model you are using?
  2. If you are running on AWS DL1 instance, what is the AMI and/or Docker setup you are using
  3. Can you provide log files of the crash when you set the batch size to 128?
anti-machinee commented 2 years ago

@greg-serochi Here is my information

  1. I use iresnet https://github.com/deepinsight/insightface/blob/master/recognition/arcface_torch/backbones/iresnet.py
  2. AMI and Docker I follow https://docs.habana.ai/en/latest/Installation_Guide/DLAMI.html
  3. I do not save the whole log file but it has RuntimeError: FATAL ERROR :: MODULE:SYNHELPER workspace Allocation of size ::14493741952 failed
greg-serochi commented 2 years ago

hi @anti-machinee, is your final objective to be able to maximize your batch size on your model? This iresnet model is still working with BS=96, correct? In this case, it may not be able to compare Batch Size across different architectures.

Note: in a future release we'll be providing some additional APIs to be able to have better visibility into on card memory usage, at this time, it's best to slowly increase the batch size until you find the threshold of pass / fail OOM.