Open szhengac opened 5 years ago
@zixuanweeei any same issue on CPU side?
@zixuanweeei any same issue on CPU side?
CPU (both w/ and w/o MKL-DNN) does have this issue. I will take a look.
@zachgk assign @szha @eric-haibin-lin any ideas? similar to your divergence issues?
@samskalicky this issue is found when we debug the BERT divergence issue.
Description
This issue was first discovered when I trained the transformer model in GluonNLP. When I double the number of gradient accumulation steps from 16 to 32 without increasing stepsize, the model can diverge at around 15 epochs. I tried several runs and the model diverged in all runs. This is strange as the stepsize is not increased. To make a comparison, I disabled grad_req='add' and create another dict for storing accumulated gradient that is obtained by implementing
acc_grad[:] += parameter.grad()
in the training script.acc_grad
is then written to the gradient buffer of the corresponding parameter beforetrainer.step()
. With such "manual" gradient accumulation, the model did not diverge.Then, in order to see how two accumulated results differ, I disabled the dropout and loaded the same initial parameters, and process the same data for several iterations. The following shows the maximum differences in terms of relative difference (%) for the aggregated gradients. The results has been filtered out such that only large number is shown. Also, relative difference (%) of the
beta, gamma
inLayerNorm
are both zeros.I also check how the difference look like in a single GPU:
As can be seen, most of significant differences come from
weight, embedding matrix
. Also, though their atol is small (1e-10), the gradient of transformer is typically in range of (1e-7, 1e-11), which is also quite small. So such small atol can lead to large difference in optimization behavior using adaptive gradient optimizer such as Adam.As reproducing the above result using transformer is computationally expensive, I write some small cases to get some similar results. I tested the codes in Mac and G4 instance. In my Mac, I also tried disable multi-threading processing by using Naive Engine, and I obtained the same output.
Error Message
Mac: [[1.9729288e-14 4.5819014e-14 0.0000000e+00 ... 0.0000000e+00 0.0000000e+00 1.0834442e-13] [1.6176574e-14 4.5819014e-14 0.0000000e+00 ... 0.0000000e+00 0.0000000e+00 9.4133563e-14]] <NDArray 2x1000 @cpu(0)> ('dense1_weight', 'rtol:11.1050821841%, atol:1.31473879093e-14')
[[1.45799305e-15 5.47375190e-17 1.03722522e-14 ... 1.08555370e-14 2.21442788e-14 4.62161266e-19] [1.56577407e-15 7.24517572e-17 1.39517583e-14 ... 1.20759946e-14 2.59407703e-14 4.74481708e-19] [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] ... [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] [2.78533020e-15 1.28712294e-16 2.14917466e-14 ... 2.03898567e-14 3.85824117e-14 9.28534639e-19]] <NDArray 1000x10 @cpu(0)> ('dense0_weight', 'rtol:100.0%, atol:1.42072226379e-14')
G4 instance: [[0.00683938 0.00235045 0.00851564 ... 0.00857326 0.01197887 0.00889575] [0.00080452 0.01059473 0.0015173 ... 0.01655126 0.00477865 0.00016829] [0.00691246 0.00388029 0.00031795 ... 0.01003232 0.00479008 0.00864812] ... [0.01424259 0.00398458 0.01044655 ... 0.02097399 0.01090044 0.00375169] [0.00423017 0.0020052 0.00448378 ... 0.00450475 0.00027684 0.00431689] [0.01112881 0.01310032 0.02486911 ... 0.00068935 0.00403444 0.00529187]] <NDArray 13x512 @gpu(0)> embedding0_weight rtol:0.6463079713284969%, atol:7.525086402893066e-07
To Reproduce
https://github.com/szhengac/Grad_Accumulation
Steps to reproduce
(Paste the commands you ran that produced the error.)
python train.py 0 1000
,python train.py 1 1000
, andpython eval.py
python train.py 0 128 0
,python train.py 1 128 0
, andpython eval.py
What have you tried to solve it?
1. 2.
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: