Fairseq stuck during Multi-gpu training without OOM warnings

hadyelsahar commented 5 years ago

I am facing quite the same problem as in https://github.com/pytorch/fairseq/issues/708 at the moment when using multi-gpu training.

The training halts forever (+8hrs) The GPU utilization goes up to 100% while mem is free and power consumption isn't reflecting the gpu utilization.

Here are my thoughts:

I doubt it is an OOM problem, I never got OOM messages I kept reducing my batch size till less than 50% of the GPU was used

I doubt it has to do anything with eval or data loading since it halts in the first epoch and data is already loaded to the mem.
I ran the same job twice with the same fixed seed on the same TESLA V100 and they halted at two different update steps.
I am using no_c10d backend and fp16

Here are the last lines of my log file for one of the halted jobs.

| epoch 001:    124 / 9333 loss=14.312, ppl=20345.88, wps=2785, ups=0, wpb=23490.943, bsz=311.082, num_updates=122, lr=0.00762599, gnorm=4.038, clip=1.000, oom=0.000, loss_scale=16.000, wall=1029, train_wall=400, MDSU:loss=14.3124, MDSU:ntokens=23490.9, MDSU:nsentences=311.082, MDSU:sample_size=23490.9
| epoch 001:    125 / 9333 loss=14.298, ppl=20140.58, wps=2797, ups=0, wpb=23480.715, bsz=311.350, num_updates=123, lr=0.00768849, gnorm=4.017, clip=1.000, oom=0.000, loss_scale=16.000, wall=1032, train_wall=403, MDSU:loss=14.2978, MDSU:ntokens=23480.7, MDSU:nsentences=311.35, MDSU:sample_size=23480.7
| epoch 001:    126 / 9333 loss=14.283, ppl=19934.59, wps=2810, ups=0, wpb=23475.718, bsz=311.548, num_updates=124, lr=0.00775099, gnorm=3.996, clip=1.000, oom=0.000, loss_scale=16.000, wall=1036, train_wall=406, MDSU:loss=14.283, MDSU:ntokens=23475.7, MDSU:nsentences=311.548, MDSU:sample_size=23475.7
| epoch 001:    127 / 9333 loss=14.266, ppl=19698.79, wps=2827, ups=0, wpb=23503.544, bsz=311.744, num_updates=125, lr=0.00781349, gnorm=3.975, clip=1.000, oom=0.000, loss_scale=16.000, wall=1039, train_wall=409, MDSU:loss=14.2658, MDSU:ntokens=23503.5, MDSU:nsentences=311.744, MDSU:sample_size=23503.5
| epoch 001:    128 / 9333 loss=14.252, ppl=19511.45, wps=2837, ups=0, wpb=23486.492, bsz=312.063, num_updates=126, lr=0.00787599, gnorm=3.955, clip=1.000, oom=0.000, loss_scale=16.000, wall=1043, train_wall=412, MDSU:loss=14.252, MDSU:ntokens=23486.5, MDSU:nsentences=312.063, MDSU:sample_size=23486.5
| epoch 001:    129 / 9333 loss=14.237, ppl=19307.72, wps=2850, ups=0, wpb=23484.008, bsz=312.315, num_updates=127, lr=0.00793849, gnorm=3.935, clip=1.000, oom=0.000, loss_scale=16.000, wall=1046, train_wall=415, MDSU:loss=14.2369, MDSU:ntokens=23484, MDSU:nsentences=312.315, MDSU:sample_size=23484
| epoch 001:    130 / 9333 loss=14.225, ppl=19151.72, wps=2858, ups=0, wpb=23444.039, bsz=312.188, num_updates=128, lr=0.00800099, gnorm=3.915, clip=1.000, oom=0.000, loss_scale=16.000, wall=1050, train_wall=418, MDSU:loss=14.2252, MDSU:ntokens=23444, MDSU:nsentences=312.188, MDSU:sample_size=23444

Here's the command I use for running training

python train.py --task xxxxxxxxx \
--MDSUdata xxxxxxxxx \
--arch xxxxxxxxx \
--max-update 670400 \
--lr-period-updates 270000 \
--lr-scheduler cosine --lr-shrink 0.75 \
--max-lr 1 \
--log-interval 1 \
--warmup-updates 16000 --warmup-init-lr 1e-06 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
--update-freq 5  --seed 666 --skip-invalid-size-inputs-valid-test \
--source-lang input --target-lang output --max-tokens 10000 --max-source-positions 1500 \
--max-target-positions 150 \
--dropout 0.1 \
--input-dict xxxxxxxx \
--output-dict xxxxxxxx \
--save-dir xxxxxxxxxx \
--ddp-backend=no_c10d --fp16

The issue is reproducible on both pytorch1.2 and 1.3 using python 3.7.4 and latest fairseq

myleott commented 4 years ago

Interesting, I haven't noticed anything like this before... did you ever figure it out?

hadyelsahar commented 4 years ago

Yes I, later on, found that it is an unhandled OOM error that caused this. Caused mainly by a spike in GPU memory utilization that is not the norm.
This wasn't detectable on Grafana as it averages gpu mem across the 1 minute so indeed on average the GPU memory util was below 100%.

What should concern you is that Fairseq failed silently and just froze for hours without any errors. Let me know if you are concerned with reproducing it or investigating more.

myleott commented 4 years ago

Yes that is concerning. Any idea where the OOM occurred? If not I can try simulating some OOMs on individual workers at various points (OOM in a middle train_step with update_freq > 1, OOM in optimizer, etc.).

hadyelsahar commented 4 years ago

Any idea where the OOM occurred?

Yes it is "OOM in a middle train_step with update_freq > 5

It seems that this is usually happening when two conditions are satisfied:

You are training on multiple GPUs
OOM is not happening in all the workers but some of them
maybe update freq > 1 (not sure though)

When OOM is happening in all the workers you usually get an error message like this:

| epoch 001:   0%|▏                                               
| WARNING: OOM in all workers, skipping update

When only some of them are OOM usually it freezes forever.

To reproduce this error on a single machine with multiple GPUs try setting --max-tokens not too high but on the limit to making your GPU mem 100%. This way OOM might happen on some workers and others not.

joaomcalves commented 4 years ago

How did you solve your problem? I'm having the same issue. thanks

hadyelsahar commented 4 years ago

It is basically an OOM error with multi-gpu training, Reducing bsz will solve the issue.

facebookresearch / fairseq

Fairseq stuck during Multi-gpu training without OOM warnings #1294