Azure / MS-AMP

Microsoft Automatic Mixed Precision Library
https://azure.github.io/MS-AMP/
MIT License
510 stars 42 forks source link

Optimizer datatype #170

Closed brianchmiel closed 2 months ago

brianchmiel commented 7 months ago

Hi,

I have some question related to the paper:

1) Which FP8 format (E4M3 / E5M2) do you use for the First Adam moment? Do you use Delayed scaling or just-in-time scaling? 2) What about the weight gradient - do you use E4M3 with Delayed scaling?

wkcn commented 7 months ago

Thanks for your attention to our work!

  1. The datatype of the first moment is fp8-e4m3, and that of the second one is fp16. They are both scaling tensors with scaling factors,which are computed just in time.
  2. The weight gradient is a fp8-e4m3 scaling tensor with a just-in-time scaling factor.
brianchmiel commented 6 months ago

Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :

https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/__init__.py#L81C9-L81C23

wkcn commented 6 months ago

Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :

https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/__init__.py#L81C9-L81C23

There is no native FP8 datatype in PyTorch yet, therefore we use uint8 to store FP8-E4M3 value.

tocean commented 2 months ago

Close the issue since there is no activity for a long time.