facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.19k stars 6.38k forks source link

Training killed with SIGKILL on restarting from checkpoint using FSDP model #3918

Open sksq96 opened 2 years ago

sksq96 commented 2 years ago

❓ Questions and Help

What is your question?

I've pretrained a 13B GPT3 model with FSDP following this guide. However, whenever I try and finetune it, providing the checkpoint as the starting point, the job is killed with the following kill message.

2021-09-28 20:03:28 | INFO | fairseq.trainer | Preparing to load checkpoint /workspaceblobstore/shubham/experiments/bigger/gptx.13B.dawn/checkpoint_last.pt
Traceback (most recent call last):
  File "/home/schandel/.pyenv/versions/anaconda3-2020.11/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/workspaceblobstore/shubham/fairseq/fairseq_cli/train.py", line 507, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/workspaceblobstore/shubham/fairseq/fairseq/distributed/utils.py", line 344, in call_main
    torch.multiprocessing.spawn(
  File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGKILL

Code

export OMP_NUM_THREADS=20
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15"
fairseq-train $DATADIR \
    --arch transformer_lm_gpt3_13 \
    --restore-file /workspaceblobstore/shubham/experiments/bigger/gptx.13B.dawn/checkpoint_last.pt \
    --task language_modeling --tokens-per-sample 2048 --batch-size 8 \
    --ddp-backend fully_sharded \
    --fp16 --fp16-init-scale 4 \
    --cpu-offload --checkpoint-activations \
    --optimizer cpu_adam --adam-betas "(0.9,0.98)" \
    --lr 0.00009 --lr-scheduler polynomial_decay --warmup-updates 5 --total-num-update 10 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-update 10 --no-save --log-format json --log-interval 1 \
    --dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 \
    --layernorm-embedding \

What's your environment?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

speechless-z commented 1 year ago

Hello, have you solved this problem? What is the solution to it?

DaliaDawod commented 1 year ago

I still have the same problem, is it solved?