facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.51k stars 726 forks source link

FSDP is incompatible with BF16 #714

Closed weigao266 closed 1 year ago

weigao266 commented 1 year ago

🐛 Bug

I am training a normal transformer model on multiple GPUs with metaseq-train, --ddp-backend=fully_sharded and --bf16 both work well individually, but thay are incompatible with each other. Is this expected for some reason? or just not support yet?

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. To reproduce the error, run the following command on a cluster:
    OMP_NUM_THREADS=20 CUDA_VISIBLE_DEVICES=4,5,6,7 \
    metaseq-train --task language_modeling \
    $DATA_DIR \
    --vocab-filename $VOCAB_FILE \
    --merges-filename $MERGES_FILE \
    --save-dir checkpoints/$prefix/${ARCH} \
    --arch $ARCH --share-decoder-input-output-embed --dropout 0.1 \
    --ddp-backend=fully_sharded --checkpoint-activations --bf16 \
    --clip-norm $CLIP_NORM \
    --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay $decay \
    --lr $LR --lr-scheduler inverse_sqrt --warmup-updates $WARM_UP --warmup-init-lr 1e-08 \
    --tokens-per-sample $TOKENS_PER_SAMPLE --sample-break-mode none \
    --max-tokens $MAX_TOKEN --update-freq $UPDATE_FREQ \
    --batch-size $BATCH_SIZE \
    --max-update $MAX_UPDATE --log-format json --log-interval 1 2>&1 | tee -a $LOG_FILE
  2. You will see error:
2023-04-27 11:00:12 | INFO | metaseq.tasks.base_task | Starting backward pass
2023-04-27 11:00:12 | INFO | metaseq.tasks.base_task | Finished first backward pass
Traceback (most recent call last):
  File "/nvme/miniconda3/envs/tnn/bin/metaseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/nvme/metaseq/metaseq/cli/train.py", line 776, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/nvme/metaseq/metaseq/distributed/utils.py", line 287, in call_main
    return _spawn_helper(main, cfg, kwargs)
  File "/nvme/metaseq/metaseq/distributed/utils.py", line 265, in _spawn_helper
    retval = distributed_main(-1, main, cfg, kwargs)
  File "/nvme/metaseq/metaseq/distributed/utils.py", line 227, in distributed_main
    retval = main(cfg, **kwargs)
  File "/nvme/metaseq/metaseq/cli/train.py", line 180, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/nvme/miniconda3/envs/tnn/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nvme/metaseq/metaseq/cli/train.py", line 379, in train
    valid_losses, should_stop = train(i, samples)
  File "/nvme/metaseq/metaseq/cli/train.py", line 274, in train
    log_output = trainer.train_step(samples)
  File "/nvme/miniconda3/envs/tnn/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nvme/metaseq/metaseq/trainer.py", line 794, in train_step
    self._check_grad_norms(grad_norm)
  File "/nvme/metaseq/metaseq/trainer.py", line 1226, in _check_grad_norms
    raise FloatingPointError(
FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
--------------------------------------------------------------------------------
grad_norm across the workers:
rank   0 = 5.74565220
rank   1 = 4.53572273
rank   2 = 2.89732289
rank   3 = 2.04310060

--------------------------------------------------------------------------------
/nvme/miniconda3/envs/tnn/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 36 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

When I follow the suggestion to use --ddp-backend=legacy_ddp, everything goes well. But this is not what I want.

Expected behavior

Expect the --ddp-backend=fully_sharded and --bf16 work well together.

Environment

suchenzang commented 1 year ago

Ah, we need to update the docs. Can you try the ngoyal_bf16_changes branch for fairscale?

weigao266 commented 1 year ago

Ah, we need to update the docs. Can you try the ngoyal_bf16_changes branch for fairscale?

Thanks, I have tried the ngoyal_bf16_changes branch for fairscale, but still got the same error.

hyoo commented 1 year ago

--bf16 use BF16 format Currently --bf16 is an added argument with --fp16 for mixed precision bf16 training or with --memory-efficient-fp16 for pure bf16 training.

you may need either --fp16 or --memory-efficient-fp16

weigao266 commented 1 year ago

--bf16 use BF16 format Currently --bf16 is an added argument with --fp16 for mixed precision bf16 training or with --memory-efficient-fp16 for pure bf16 training.

you may need either --fp16 or --memory-efficient-fp16

It works. Thanks!