dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 535 forks source link

Multi GPU context not updated #982

Open kristjanArumae opened 5 years ago

kristjanArumae commented 5 years ago

Description

I've implemented multi-GPU support (which works fine). But when I add the following:

trainer.allreduce_grads()
nlp.utils.clip_grad_global_norm(params, 1)
trainer.update(args.accumulate if args.accumulate else 1)

I get a warning which terminates the program.

The BERT specific training loop code I added is largely taken from here: https://github.com/dmlc/gluon-nlp/blob/master/scripts/bert/finetune_classifier.py

I am not excluding any params from training, and when I switch back to single GPU this problem goes away.

Error Message

File "/home/code/src/train/train.py", line 266, in <module>
    train_model(args)
  File "/home/code/src/train/train.py", line 194, in train_model
    trainer.update(args.accumulate if args.accumulate else 1)
  File "/home/anaconda3/lib/python3.7/site-packages/mxnet/gluon/trainer.py", line 397, in update
    self._update(ignore_stale_grad)
  File "/home/anaconda3/lib/python3.7/site-packages/mxnet/gluon/trainer.py", line 416, in _update
    %(param.name, str(data.context)))
UserWarning: Gradient of Parameter `bertencoder0_position_weight` on context gpu(5) has not been updated by backward since last `step`. This could mean a bug in your model that made it only use a subset of the Parameters (Blocks) for this iteration. If you are intentionally only using a subset, call step with ignore_stale_grad=True to suppress this warning and skip updating of Parameters with stale gradient

What have you tried to solve it?

  1. Looked for the warning elsewhere found threads which did not solve the issue such as this one: https://github.com/zackchase/mxnet-the-straight-dope/issues/348

Environment

----------Python Info---------- Version : 3.7.3 Compiler : GCC 7.3.0 Build : ('default', 'Mar 27 2019 22:11:17') Arch : ('64bit', '') ------------Pip Info----------- Version : 19.1.1 Directory : /home/anaconda3/lib/python3.7/site-packages/pip ----------MXNet Info----------- Version : 1.5.1 Directory : /home/anaconda3/lib/python3.7/site-packages/mxnet Num GPUs : 8 Commit Hash : c9818480680f84daa6e281a974ab263691302ba8 ----------System Info---------- Platform : Linux-4.4.0-1092-aws-x86_64-with-debian-stretch-sid system : Linux node : ip-172-31-30-6 release : 4.4.0-1092-aws version : #103-Ubuntu SMP Tue Aug 27 10:21:48 UTC 2019 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Stepping: 1 CPU MHz: 2699.984 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4600.07 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida

eric-haibin-lin commented 5 years ago

I found that this happens if you have a dataset with last batch whose sample count (e.g. 3) is less than the number of GPUs (e.g. 4). Therefore some GPUs complain that the gradient is stale

eric-haibin-lin commented 5 years ago

This is an issue on mxnet side. I'll look into the fix

eric-haibin-lin commented 4 years ago

@sxjscience DadaLoader does not support last_batch = rollover if the batch_sampler is set. To skip this final batch, either the bucketSampler needs to support the last_batch argument, or users have to manually skip that batch in the training loop

sxjscience commented 4 years ago

We should add the flag to the BucketSampler.

djaym7 commented 4 years ago

Same problem, any fix yet ?

sxjscience commented 4 years ago

@djaym7 You may try to ignore the stale gradient as in https://github.com/dmlc/gluon-nlp/blob/982a4164c35400a3fa5d5d7642915306ca3a7fd1/scripts/question_answering/run_squad.py#L495