Finetuning fp32 model in fp16 mode can lead to (many) dropped batches even with `--fp16-scale-tolerance=0`

munael commented 4 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

I don't have a small enough MWE...

Run any finetuning command on a model originally trained in fp32 mode but include:
1. --fp16
2. --fp16-scale-tolerance 0
You should be dropping batches.
To debug, include a print or debug hook in the else branch here: https://github.com/pytorch/fairseq/blob/5e82514d687289a73a6dec33b555217acd97cb0d/fairseq_cli/train.py#L210-L219
With some luck, you'll drop all batches in an epoch and fail training cause of an undefined variable (that is first declared in a never-taken path in the loop.

Code sample

:(

Expected behavior

Overflow failures that drop entire batches shouldn't be silently ignored.
Better handling of model conversion from fp32 to fp16.

Environment

fairseq Version (e.g., 1.0 or master): 0.9
PyTorch Version (e.g., 1.0): 1.6
OS (e.g., Linux): Ubuntu 18.04
How you installed fairseq (pip, source): git tag 0.9.0 with debugging changes added

Build command you used (if compiling from source):

python setup.py build_ext --inplace
pip install --user .

Python version: 3.6.9
CUDA/cuDNN version: 10.2/
GPU models and configuration: 1x V100 SXM2
Any other relevant information: N/A

Additional context

myleott commented 4 years ago

Can you try using fairseq master instead of 0.9.0?

Anyway, this is most likely model/dataset dependent, but I will try this a bit later and see what’s going on.

To confirm, what kind of task/model is this?

munael commented 4 years ago

I'll check.

It does seem to depend on model/dataset. But is the batch-skipping behavior actually intentional? Why? Why not (for example) repeat with updated loss scale until it's correct?

Translation with pre-trained transformer_vaswani_wmt_en_de_big (transformer.wmt14.en-fr) from here (https://github.com/pytorch/fairseq/tree/master/examples/translation#pre-trained-models).

munael commented 3 years ago

@myleott This still occurs with the latest master :/

facebookresearch / fairseq