FSDP is incompatible with BF16

weigao266 commented 1 year ago

🐛 Bug

I am training a normal transformer model on multiple GPUs with metaseq-train, --ddp-backend=fully_sharded and --bf16 both work well individually, but thay are incompatible with each other. Is this expected for some reason? or just not support yet?

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

To reproduce the error, run the following command on a cluster：

OMP_NUM_THREADS=20 CUDA_VISIBLE_DEVICES=4,5,6,7 \
metaseq-train --task language_modeling \
$DATA_DIR \
--vocab-filename $VOCAB_FILE \
--merges-filename $MERGES_FILE \
--save-dir checkpoints/$prefix/${ARCH} \
--arch $ARCH --share-decoder-input-output-embed --dropout 0.1 \
--ddp-backend=fully_sharded --checkpoint-activations --bf16 \
--clip-norm $CLIP_NORM \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay $decay \
--lr $LR --lr-scheduler inverse_sqrt --warmup-updates $WARM_UP --warmup-init-lr 1e-08 \
--tokens-per-sample $TOKENS_PER_SAMPLE --sample-break-mode none \
--max-tokens $MAX_TOKEN --update-freq $UPDATE_FREQ \
--batch-size $BATCH_SIZE \
--max-update $MAX_UPDATE --log-format json --log-interval 1 2>&1 | tee -a $LOG_FILE

You will see error:

2023-04-27 11:00:12 | INFO | metaseq.tasks.base_task | Starting backward pass
2023-04-27 11:00:12 | INFO | metaseq.tasks.base_task | Finished first backward pass
Traceback (most recent call last):
  File "/nvme/miniconda3/envs/tnn/bin/metaseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/nvme/metaseq/metaseq/cli/train.py", line 776, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/nvme/metaseq/metaseq/distributed/utils.py", line 287, in call_main
    return _spawn_helper(main, cfg, kwargs)
  File "/nvme/metaseq/metaseq/distributed/utils.py", line 265, in _spawn_helper
    retval = distributed_main(-1, main, cfg, kwargs)
  File "/nvme/metaseq/metaseq/distributed/utils.py", line 227, in distributed_main
    retval = main(cfg, **kwargs)
  File "/nvme/metaseq/metaseq/cli/train.py", line 180, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/nvme/miniconda3/envs/tnn/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nvme/metaseq/metaseq/cli/train.py", line 379, in train
    valid_losses, should_stop = train(i, samples)
  File "/nvme/metaseq/metaseq/cli/train.py", line 274, in train
    log_output = trainer.train_step(samples)
  File "/nvme/miniconda3/envs/tnn/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nvme/metaseq/metaseq/trainer.py", line 794, in train_step
    self._check_grad_norms(grad_norm)
  File "/nvme/metaseq/metaseq/trainer.py", line 1226, in _check_grad_norms
    raise FloatingPointError(
FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
--------------------------------------------------------------------------------
grad_norm across the workers:
rank   0 = 5.74565220
rank   1 = 4.53572273
rank   2 = 2.89732289
rank   3 = 2.04310060

--------------------------------------------------------------------------------
/nvme/miniconda3/envs/tnn/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 36 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

When I follow the suggestion to use --ddp-backend=legacy_ddp, everything goes well. But this is not what I want.

Expected behavior

Expect the --ddp-backend=fully_sharded and --bf16 work well together.

Environment

metaseq Version: 0.0.1
PyTorch Version: 2.0.0
OS (e.g., Linux, Windows, MacOS): linux
How you installed metaseq (pip, source): pip
Build command you used (if compiling from source):
Python version: 3.9.16
CUDA/cuDNN version: cuda11.8_cudnn8.7.0
GPU models and configuration: A100
Any other relevant information: No

suchenzang commented 1 year ago

Ah, we need to update the docs. Can you try the ngoyal_bf16_changes branch for fairscale?

weigao266 commented 1 year ago

Ah, we need to update the docs. Can you try the ngoyal_bf16_changes branch for fairscale?

Thanks, I have tried the ngoyal_bf16_changes branch for fairscale, but still got the same error.

hyoo commented 1 year ago

--bf16 use BF16 format Currently --bf16 is an added argument with --fp16 for mixed precision bf16 training or with --memory-efficient-fp16 for pure bf16 training.

you may need either --fp16 or --memory-efficient-fp16

weigao266 commented 1 year ago

--bf16 use BF16 format Currently --bf16 is an added argument with --fp16 for mixed precision bf16 training or with --memory-efficient-fp16 for pure bf16 training.

you may need either --fp16 or --memory-efficient-fp16

It works. Thanks!

facebookresearch / metaseq