apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Grad Accumulation by using grad_req='add' has numerical issue #16686

Open szhengac opened 5 years ago

szhengac commented 5 years ago

Description

This issue was first discovered when I trained the transformer model in GluonNLP. When I double the number of gradient accumulation steps from 16 to 32 without increasing stepsize, the model can diverge at around 15 epochs. I tried several runs and the model diverged in all runs. This is strange as the stepsize is not increased. To make a comparison, I disabled grad_req='add' and create another dict for storing accumulated gradient that is obtained by implementing acc_grad[:] += parameter.grad() in the training script. acc_grad is then written to the gradient buffer of the corresponding parameter before trainer.step(). With such "manual" gradient accumulation, the model did not diverge.

Then, in order to see how two accumulated results differ, I disabled the dropout and loaded the same initial parameters, and process the same data for several iterations. The following shows the maximum differences in terms of relative difference (%) for the aggregated gradients. The results has been filtered out such that only large number is shown. Also, relative difference (%) of the beta, gamma in LayerNorm are both zeros.

image

I also check how the difference look like in a single GPU:

image

As can be seen, most of significant differences come from weight, embedding matrix. Also, though their atol is small (1e-10), the gradient of transformer is typically in range of (1e-7, 1e-11), which is also quite small. So such small atol can lead to large difference in optimization behavior using adaptive gradient optimizer such as Adam.

As reproducing the above result using transformer is computationally expensive, I write some small cases to get some similar results. I tested the codes in Mac and G4 instance. In my Mac, I also tried disable multi-threading processing by using Naive Engine, and I obtained the same output.

Error Message

Mac: [[1.9729288e-14 4.5819014e-14 0.0000000e+00 ... 0.0000000e+00 0.0000000e+00 1.0834442e-13] [1.6176574e-14 4.5819014e-14 0.0000000e+00 ... 0.0000000e+00 0.0000000e+00 9.4133563e-14]] <NDArray 2x1000 @cpu(0)> ('dense1_weight', 'rtol:11.1050821841%, atol:1.31473879093e-14')

[[1.45799305e-15 5.47375190e-17 1.03722522e-14 ... 1.08555370e-14 2.21442788e-14 4.62161266e-19] [1.56577407e-15 7.24517572e-17 1.39517583e-14 ... 1.20759946e-14 2.59407703e-14 4.74481708e-19] [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] ... [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] [2.78533020e-15 1.28712294e-16 2.14917466e-14 ... 2.03898567e-14 3.85824117e-14 9.28534639e-19]] <NDArray 1000x10 @cpu(0)> ('dense0_weight', 'rtol:100.0%, atol:1.42072226379e-14')

G4 instance: [[0.00683938 0.00235045 0.00851564 ... 0.00857326 0.01197887 0.00889575] [0.00080452 0.01059473 0.0015173 ... 0.01655126 0.00477865 0.00016829] [0.00691246 0.00388029 0.00031795 ... 0.01003232 0.00479008 0.00864812] ... [0.01424259 0.00398458 0.01044655 ... 0.02097399 0.01090044 0.00375169] [0.00423017 0.0020052 0.00448378 ... 0.00450475 0.00027684 0.00431689] [0.01112881 0.01310032 0.02486911 ... 0.00068935 0.00403444 0.00529187]] <NDArray 13x512 @gpu(0)> embedding0_weight rtol:0.6463079713284969%, atol:7.525086402893066e-07

To Reproduce

https://github.com/szhengac/Grad_Accumulation

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. In mac folder, run python train.py 0 1000, python train.py 1 1000, and python eval.py
  2. In gpu folder, run python train.py 0 128 0, python train.py 1 128 0, and python eval.py

What have you tried to solve it?

1. 2.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

**Mac:**
----------Python Info----------
('Version      :', '2.7.16')
('Compiler     :', 'GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)')
('Build        :', ('default', 'Oct 16 2019 00:34:56'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
('Version      :', '19.2.3')
('Directory    :', '/Library/Python/2.7/site-packages/pip-19.2.3-py2.7.egg/pip')
----------MXNet Info-----------
dyld: warning, LC_RPATH ${ORIGIN} in /Library/Python/2.7/site-packages/mxnet/libmxnet.so being ignored in restricted program because it is a relative path
('Version      :', '1.6.0')
('Directory    :', '/Library/Python/2.7/site-packages/mxnet')
('Num GPUs     :', 0)
('Commit Hash   :', '72b4d9b8261e0783989447ad78a09e8573aee853')
----------System Info----------
('Platform     :', 'Darwin-18.7.0-x86_64-i386-64bit')
('system       :', 'Darwin')
('node         :', 'a483e789dd93.ant.amazon.com')
('release      :', '18.7.0')
('version      :', 'Darwin Kernel Version 18.7.0: Sat Oct 12 00:02:19 PDT 2019; root:xnu-4903.278.12~1/RELEASE_X86_64')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'i386')
machdep.cpu.brand_string: Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 AVX2 SMEP BMI2 ERMS INVPCID FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI
----------Network Test----------
Setting timeout: 10
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0349 sec, LOAD: 0.4839 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0353 sec, LOAD: 0.0811 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0721 sec, LOAD: 0.1677 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0409 sec, LOAD: 0.1224 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0286 sec, LOAD: 0.6482 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0449 sec, LOAD: 0.2207 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0417 sec, LOAD: 0.1004 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0009 sec, LOAD: 0.6126 sec.

**G4 instance:**
----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 5.4.0 20160609')
('Build        :', ('default', 'Oct  8 2019 14:14:10'))
('Arch         :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version      :', '19.3.1')
('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
('Platform     :', 'Linux-4.4.0-1096-aws-x86_64-with-Ubuntu-16.04-xenial')
('system       :', 'Linux')
('node         :', 'ip-172-31-0-43')
('release      :', '4.4.0-1096-aws')
('version      :', '#107-Ubuntu SMP Thu Oct 3 01:51:58 UTC 2019')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:              7
CPU MHz:               2499.998
BogoMIPS:              4999.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
----------Network Test----------
Setting timeout: 10
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0012 sec, LOAD: 0.3014 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0281 sec, LOAD: 0.2798 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0896 sec, LOAD: 0.2237 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0133 sec, LOAD: 0.0610 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0008 sec, LOAD: 0.5152 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0292 sec, LOAD: 0.2377 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0300 sec, LOAD: 0.0639 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0002 sec, LOAD: 0.3814 sec.
pengzhao-intel commented 5 years ago

@zixuanweeei any same issue on CPU side?

zixuanweeei commented 5 years ago

@zixuanweeei any same issue on CPU side?

CPU (both w/ and w/o MKL-DNN) does have this issue. I will take a look.

samskalicky commented 5 years ago

@zachgk assign @szha @eric-haibin-lin any ideas? similar to your divergence issues?

eric-haibin-lin commented 5 years ago

@samskalicky this issue is found when we debug the BERT divergence issue.