Open hadyelsahar opened 5 years ago
Interesting, I haven't noticed anything like this before... did you ever figure it out?
Yes
I, later on, found that it is an unhandled OOM error that caused this. Caused mainly by a spike in GPU memory utilization that is not the norm.
This wasn't detectable on Grafana as it averages gpu mem across the 1 minute so indeed on average the GPU memory util was below 100%.
What should concern you is that Fairseq failed silently and just froze for hours without any errors. Let me know if you are concerned with reproducing it or investigating more.
Yes that is concerning. Any idea where the OOM occurred? If not I can try simulating some OOMs on individual workers at various points (OOM in a middle train_step with update_freq > 1, OOM in optimizer, etc.).
Any idea where the OOM occurred?
Yes it is "OOM in a middle train_step with update_freq > 5
It seems that this is usually happening when two conditions are satisfied:
When OOM is happening in all the workers you usually get an error message like this:
| epoch 001: 0%|▏
| WARNING: OOM in all workers, skipping update
When only some of them are OOM usually it freezes forever.
To reproduce this error on a single machine with multiple GPUs try setting --max-tokens
not too high but on the limit to making your GPU mem 100%. This way OOM might happen on some workers and others not.
How did you solve your problem? I'm having the same issue. thanks
It is basically an OOM error with multi-gpu training, Reducing bsz will solve the issue.
I am facing quite the same problem as in https://github.com/pytorch/fairseq/issues/708 at the moment when using multi-gpu training.
The training halts forever (+8hrs) The GPU utilization goes up to 100% while mem is free and power consumption isn't reflecting the gpu utilization.
Here are my thoughts:
I doubt it has to do anything with eval or data loading since it halts in the first epoch and data is already loaded to the mem.
I ran the same job twice with the same fixed seed on the same TESLA V100 and they halted at two different update steps.
I am using no_c10d backend and fp16
Here are the last lines of my log file for one of the halted jobs.
Here's the command I use for running training
The issue is reproducible on both pytorch1.2 and 1.3 using python 3.7.4 and latest fairseq