Open Marks101 opened 1 month ago
I found the same problem. In one of my training configurations, The first two rounds of training loss with te1.4 and te1.7 are printed as follows. The first round loss value of te1.7 is more than 90, which is obviously abnormal.
te1.4 1.003214E+00 1.010536E+00 te1.7 9.757925E+01 1.013275E+00
Then I printed the loss of each micro bs. The loss of the first micro bs was normal, but the loss of the second micro bs and later ones was abnormally large. After some troubleshooting, I discovered that the scale_inv updates in the two versions of LayerNormLinear were not aligned, which was probably the cause of the weird results.
By the way, after upgrading to te1.9, this problem no longer occurs.
Hello team,
we have been debugging large scale training instabilities with FP8 and noticed that these started when updating from transfomer-engine v1.2.1 to v1.7. Taking a closer look at the trainings, it occured to me that the first iteration shows a loss that is larger than in trainings with the old version or in trainings with BF16. I was able to reproduce this with this minimal example:
Executing this piece of code with version v1.2.1 gives me:
In comparison to version 1.7:
After the first microbatch, the
Linear
produces a wrong result for version 1.7. Could you please try to reproduce this?This is specifically connected to the case that
is_first_microbatch
is used. The same bug applies toLayerNormLinear
and thus toMultiheadAttention
... . We bisected this and came to the conclusion that it started with #575 and got (by coincidence?) fixed with the refactorings in #820 and seems to be connected to the update of the FP8 buffers. I am not 100% sure, but for me it seems that In a training information from old iterations is used and this can cause instabilities. Overall, the current version v1.8 is not affected. Still, if you are able to reproduce this and considering that this is a "silent" bug that caused heavy instabilities on our side, it might be worth to add this to the "known issues" section.Please let me know if you need more information from our side.
More details on our setup: DGX H100, CUDA 12.2, Torch 2.3.0