Translation with FSDP gets stuck in validation step

thies1006 commented 3 years ago

What is your question?

How do I run training&validation of a translation model with FSDP?

Code

OMP_NUM_THREADS=20 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    fairseq-train data-bin/iwslt14.tokenized.de-en \
    --ddp-backend fully_sharded --fp16 --fp16-init-scale 4 \
    --cpu-offload --checkpoint-activations \
    --task translation --max-tokens 4096 \
    --arch transformer_iwslt_de_en --share-decoder-input-output-embed \
    --optimizer cpu_adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --log-format json --log-interval 1 \
    --save-interval-updates 5 \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

What have you tried?

The example with the language model is running fine. Then I tried to merge the LM FSDP example with the ordinary example for translation. The above script runs training fine (5 steps), but afterwards it gets stuck in the validation part after the first example. GPUs and CPUs are continuing working.

Output:

2021-05-05 13:07:40 | INFO | fairseq_cli.train | task: TranslationTask
2021-05-05 13:07:40 | INFO | fairseq_cli.train | model: FullyShardedDataParallel
2021-05-05 13:07:40 | INFO | fairseq_cli.train | criterion: LabelSmoothedCrossEntropyCriterion
2021-05-05 13:07:40 | INFO | fairseq_cli.train | num. shared model params: 4,933,632 (num. trained: 4,933,632)
2021-05-05 13:07:40 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
2021-05-05 13:07:40 | INFO | fairseq.data.data_utils | loaded 7,283 examples from: data-bin/iwslt14.tokenized.de-en/valid.de-en.de
2021-05-05 13:07:40 | INFO | fairseq.data.data_utils | loaded 7,283 examples from: data-bin/iwslt14.tokenized.de-en/valid.de-en.en
2021-05-05 13:07:40 | INFO | fairseq.tasks.translation | data-bin/iwslt14.tokenized.de-en valid de-en 7283 examples
2021-05-05 13:07:40 | INFO | fairseq.utils | ***********************CUDA enviroments for all 8 workers***********************
2021-05-05 13:07:40 | INFO | fairseq.utils | rank   0: capabilities =  7.5  ; total memory = 14.756 GB ; name = Tesla T4                                
2021-05-05 13:07:40 | INFO | fairseq.utils | rank   1: capabilities =  7.5  ; total memory = 14.756 GB ; name = Tesla T4                                
2021-05-05 13:07:40 | INFO | fairseq.utils | rank   2: capabilities =  7.5  ; total memory = 14.756 GB ; name = Tesla T4                                
2021-05-05 13:07:40 | INFO | fairseq.utils | rank   3: capabilities =  7.5  ; total memory = 14.756 GB ; name = Tesla T4                                
2021-05-05 13:07:40 | INFO | fairseq.utils | rank   4: capabilities =  7.5  ; total memory = 14.756 GB ; name = Tesla T4                                
2021-05-05 13:07:40 | INFO | fairseq.utils | rank   5: capabilities =  7.5  ; total memory = 14.756 GB ; name = Tesla T4                                
2021-05-05 13:07:40 | INFO | fairseq.utils | rank   6: capabilities =  7.5  ; total memory = 14.756 GB ; name = Tesla T4                                
2021-05-05 13:07:40 | INFO | fairseq.utils | rank   7: capabilities =  7.5  ; total memory = 14.756 GB ; name = Tesla T4                                
2021-05-05 13:07:40 | INFO | fairseq.utils | ***********************CUDA enviroments for all 8 workers***********************
2021-05-05 13:07:40 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2021-05-05 13:07:40 | INFO | fairseq_cli.train | max tokens per device = 4096 and max sentences per device = None
2021-05-05 13:07:40 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last-shard0.pt
2021-05-05 13:07:40 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last-shard0.pt
2021-05-05 13:07:40 | INFO | fairseq.trainer | loading train data for epoch 1
2021-05-05 13:07:40 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: data-bin/iwslt14.tokenized.de-en/train.de-en.de
2021-05-05 13:07:40 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: data-bin/iwslt14.tokenized.de-en/train.de-en.en
2021-05-05 13:07:40 | INFO | fairseq.tasks.translation | data-bin/iwslt14.tokenized.de-en train de-en 160239 examples
Using /secondary/thies/.cache/torch_extensions as PyTorch extensions root...
Using /secondary/thies/.cache/torch_extensions as PyTorch extensions root...
Using /secondary/thies/.cache/torch_extensions as PyTorch extensions root...
Using /secondary/thies/.cache/torch_extensions as PyTorch extensions root...
Using /secondary/thies/.cache/torch_extensions as PyTorch extensions root...
Using /secondary/thies/.cache/torch_extensions as PyTorch extensions root...
Using /secondary/thies/.cache/torch_extensions as PyTorch extensions root...
Using /secondary/thies/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /secondary/thies/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7473657131195068 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7493686676025391 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.8453831672668457 seconds
Time to load cpu_adam op: 0.7807090282440186 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7946493625640869 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7924172878265381 seconds
Time to load cpu_adam op: 0.8085505962371826 seconds
Time to load cpu_adam op: 0.7884249687194824 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000500, betas=(0.900000, 0.980000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000500, betas=(0.900000, 0.980000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000500, betas=(0.900000, 0.980000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000500, betas=(0.900000, 0.980000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000500, betas=(0.900000, 0.980000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000500, betas=(0.900000, 0.980000), weight_decay=0.000100, adam_w=1
2021-05-05 13:07:43 | INFO | fairseq.trainer | begin training epoch 1
2021-05-05 13:07:43 | INFO | fairseq_cli.train | Start iterating over samples
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000500, betas=(0.900000, 0.980000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000500, betas=(0.900000, 0.980000), weight_decay=0.000100, adam_w=1
2021-05-05 13:07:47 | INFO | train_inner | {"epoch": 1, "update": 0.007, "loss": "13.498", "nll_loss": "13.506", "ppl": "11636.6", "wps": "0", "ups": "0", "wpb": "28516", "bsz": "1272", "num_updates": "1", "lr": "1.25e-07", "gnorm": "5.345", "loss_scale": "4", "train_wall": "2", "gb_free": "14.4", "wall": "7"}
2021-05-05 13:07:48 | INFO | train_inner | {"epoch": 1, "update": 0.014, "loss": "13.472", "nll_loss": "13.477", "ppl": "11405.9", "wps": "31140.7", "ups": "1.06", "wpb": "29497", "bsz": "1080", "num_updates": "2", "lr": "2.5e-07", "gnorm": "5.429", "loss_scale": "4", "train_wall": "1", "gb_free": "14.3", "wall": "8"}
2021-05-05 13:07:49 | INFO | train_inner | {"epoch": 1, "update": 0.022, "loss": "13.474", "nll_loss": "13.48", "ppl": "11426.5", "wps": "28879", "ups": "1", "wpb": "28881", "bsz": "1056", "num_updates": "3", "lr": "3.75e-07", "gnorm": "5.46", "loss_scale": "4", "train_wall": "1", "gb_free": "14.3", "wall": "9"}
2021-05-05 13:07:50 | INFO | train_inner | {"epoch": 1, "update": 0.029, "loss": "13.458", "nll_loss": "13.462", "ppl": "11286.3", "wps": "23210.6", "ups": "0.81", "wpb": "28624", "bsz": "824", "num_updates": "4", "lr": "5e-07", "gnorm": "5.482", "loss_scale": "4", "train_wall": "1", "gb_free": "14.3", "wall": "10"}
2021-05-05 13:07:51 | INFO | train_inner | {"epoch": 1, "update": 0.036, "loss": "13.469", "nll_loss": "13.475", "ppl": "11384.5", "wps": "31104.4", "ups": "1.06", "wpb": "29284", "bsz": "960", "num_updates": "5", "lr": "6.25e-07", "gnorm": "5.432", "loss_scale": "4", "train_wall": "1", "gb_free": "14.3", "wall": "11"}
2021-05-05 13:07:51 | INFO | fairseq_cli.train | begin validation on "valid" subset
2021-05-05 13:07:55 | INFO | fairseq.tasks.translation | example hypothesis: senses senses senses texas texas texas texas texas texas texas texas texas texas texas texas texas texas texas
2021-05-05 13:07:55 | INFO | fairseq.tasks.translation | example reference: they're just not moving.

(not continues, waited for ~20 min)

What's your environment?

fairseq Version :1.0.0a0+a4e1d4a (master from 2021-05-02 )
fairscale 0.3.6
deepspeed 0.3.16
PyTorch Version 1.8.1
OS Linux:
How you installed fairseq pip:
Build command you used (if compiling from source):
Python version: 3.7.5
CUDA/cuDNN version: 11.1.105
GPU models and configuration: 8x T4
Any other relevant information:

thies1006 commented 3 years ago

Apparently the script works when removing --checkpoint-actications.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq