HabanaAI / Model-References

Reference models for Intel(R) Gaudi(R) AI Accelerator
141 stars 67 forks source link

Habana Gaudi HPUs Training time improvement #26

Closed purvang3 closed 1 year ago

purvang3 commented 1 year ago

Hi, I am using Habana® Deep Learning Base AMI tensorflow 2.9.1, aws ec2 instance to train Image segmentation model. Training time for same dataset using similar number of Hpus/Gpus and almost similar Hpu/Gpu memory(40GB vs 32GB) taking more training time compared to Nvidia A100. Key modification from Nvidia Gpu training script is using horovod, habana modules and MPI.

Below I have provided script link which also consist command to be run.

https://drive.google.com/drive/folders/1Vq5f9lk_jiRlbkcwpqN_ecS3GC4gKO3W?usp=sharing

Please let me know if there is any modifications, that I need to make in code to run faster to compare result with other vendors.

Thank you

purvang3 commented 1 year ago

Also, when above program runs, HPU utilization is very less as shown. but if I increase the batch size, I get OOM error.

image
ssarkar2 commented 1 year ago

Hi @purvang3 Please refer to the response here