Open kristjanArumae opened 5 years ago
I found that this happens if you have a dataset with last batch whose sample count (e.g. 3) is less than the number of GPUs (e.g. 4). Therefore some GPUs complain that the gradient is stale
This is an issue on mxnet side. I'll look into the fix
@sxjscience DadaLoader does not support last_batch = rollover if the batch_sampler is set. To skip this final batch, either the bucketSampler needs to support the last_batch argument, or users have to manually skip that batch in the training loop
We should add the flag to the BucketSampler.
Same problem, any fix yet ?
@djaym7 You may try to ignore the stale gradient as in https://github.com/dmlc/gluon-nlp/blob/982a4164c35400a3fa5d5d7642915306ca3a7fd1/scripts/question_answering/run_squad.py#L495
Description
I've implemented multi-GPU support (which works fine). But when I add the following:
I get a warning which terminates the program.
The BERT specific training loop code I added is largely taken from here: https://github.com/dmlc/gluon-nlp/blob/master/scripts/bert/finetune_classifier.py
I am not excluding any params from training, and when I switch back to single GPU this problem goes away.
Error Message
What have you tried to solve it?
Environment
----------Python Info---------- Version : 3.7.3 Compiler : GCC 7.3.0 Build : ('default', 'Mar 27 2019 22:11:17') Arch : ('64bit', '') ------------Pip Info----------- Version : 19.1.1 Directory : /home/anaconda3/lib/python3.7/site-packages/pip ----------MXNet Info----------- Version : 1.5.1 Directory : /home/anaconda3/lib/python3.7/site-packages/mxnet Num GPUs : 8 Commit Hash : c9818480680f84daa6e281a974ab263691302ba8 ----------System Info---------- Platform : Linux-4.4.0-1092-aws-x86_64-with-debian-stretch-sid system : Linux node : ip-172-31-30-6 release : 4.4.0-1092-aws version : #103-Ubuntu SMP Tue Aug 27 10:21:48 UTC 2019 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Stepping: 1 CPU MHz: 2699.984 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4600.07 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida