Closed brianchmiel closed 2 months ago
Thanks for your attention to our work!
Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :
Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :
There is no native FP8 datatype in PyTorch yet, therefore we use uint8 to store FP8-E4M3 value.
Close the issue since there is no activity for a long time.
Hi,
I have some question related to the paper:
1) Which FP8 format (E4M3 / E5M2) do you use for the First Adam moment? Do you use Delayed scaling or just-in-time scaling? 2) What about the weight gradient - do you use E4M3 with Delayed scaling?