facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.43k stars 6.4k forks source link

Finetuning fp32 model in fp16 mode can lead to (many) dropped batches even with `--fp16-scale-tolerance=0` #2697

Open munael opened 4 years ago

munael commented 4 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

I don't have a small enough MWE...

  1. Run any finetuning command on a model originally trained in fp32 mode but include:
    1. --fp16
    2. --fp16-scale-tolerance 0
  2. You should be dropping batches.
  3. To debug, include a print or debug hook in the else branch here: https://github.com/pytorch/fairseq/blob/5e82514d687289a73a6dec33b555217acd97cb0d/fairseq_cli/train.py#L210-L219
  4. With some luck, you'll drop all batches in an epoch and fail training cause of an undefined variable (that is first declared in a never-taken path in the loop.

Code sample

:(

Expected behavior

  1. Overflow failures that drop entire batches shouldn't be silently ignored.
  2. Better handling of model conversion from fp32 to fp16.

Environment

Additional context

myleott commented 4 years ago

Can you try using fairseq master instead of 0.9.0?

Anyway, this is most likely model/dataset dependent, but I will try this a bit later and see what’s going on.

To confirm, what kind of task/model is this?

munael commented 4 years ago

I'll check.

It does seem to depend on model/dataset. But is the batch-skipping behavior actually intentional? Why? Why not (for example) repeat with updated loss scale until it's correct?

Translation with pre-trained transformer_vaswani_wmt_en_de_big (transformer.wmt14.en-fr) from here (https://github.com/pytorch/fairseq/tree/master/examples/translation#pre-trained-models).

munael commented 3 years ago

@myleott This still occurs with the latest master :/