facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.58k stars 6.41k forks source link

How to use multi-GPU #4591

Open krcc5978 opened 2 years ago

krcc5978 commented 2 years ago

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I want to do fine tuning of the BART summary model. The machine spec I'm using is AWS p3.8xlarge.

I run the command below

CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train cnn_dm-bin \ --restore-file $BART_PATH \ --max-tokens $MAX_TOKENS \ --task translation \ --source-lang source --target-lang target \ --truncate-source \ --layernorm-embedding \ --share-all-embeddings \ --share-decoder-input-output-embed \ --reset-optimizer --reset-dataloader --reset-meters \ --required-batch-size-multiple 1 \ --arch bart_large \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \ --clip-norm 0.1 \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --fp16 --update-freq $UPDATE_FREQ \ --skip-invalid-size-inputs-valid-test \ --find-unused-parameters;

When I do this, I get an allocate error and I can't learn well. This seems like you can't use multi-GPU How can I learn on a multi-GPU without allocating?

What's your environment?

gmryu commented 2 years ago

I suggest you try a vanilla excution, says

  1. no --restore-file
  2. --batch-size 1 instead of --max-tokens
  3. --arch bart_base
  4. you may need to remove those --reset-... arguments and adjust according to the error log

Then please paste the error log if it is a gpu problem. If you managed to run a vanilla training, then you might just give too big max_tokens or arch.

krcc5978 commented 2 years ago

Hi, Thank you for your reply.

I tried to do as your suggestion However, the model I wanted to use was bart.large, so I'm excluding the third proposal.

The result is the same as last time, an allocate error occurred.

2022-07-20 00:06:42 | ERROR | fairseq.trainer | OOM during optimization, irrecoverable Traceback (most recent call last): File "/envs/fairseq_en/bin/fairseq-train", line 33, in sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/bart/fairseq/fairseq_cli/train.py", line 557, in cli_main distributed_utils.call_main(cfg, main) File "/bart/fairseq/fairseq/distributed/utils.py", line 351, in call_main join=True, File "/envs/fairseq_en/lib64/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/envs/fairseq_en/lib64/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/envs/fairseq_en/lib64/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/envs/fairseq_en/lib64/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/bart/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main main(cfg, kwargs) File "/bart/fairseq/fairseq_cli/train.py", line 190, in main valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/usr/lib64/python3.7/contextlib.py", line 74, in inner return func(*args, *kwds) File "/bart/fairseq/fairseq_cli/train.py", line 316, in train log_output = trainer.train_step(samples) File "/usr/lib64/python3.7/contextlib.py", line 74, in inner return func(args, kwds) File "/bart/fairseq/fairseq/trainer.py", line 1001, in train_step raise e File "/bart/fairseq/fairseq/trainer.py", line 955, in train_step self.optimizer, model=self.model, update_num=self.get_num_updates() File "/bart/fairseq/fairseq/tasks/fairseq_task.py", line 531, in optimizer_step optimizer.step() File "/bart/fairseq/fairseq/optim/fp16_optimizer.py", line 218, in step self.fp32_optimizer.step(closure, groups=groups) File "/bart/fairseq/fairseq/optim/fairseq_optimizer.py", line 127, in step self.optimizer.step(closure) File "/envs/fairseq_en/lib64/python3.7/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(*args, **kwargs) File "/bart/fairseq/fairseq/optim/adam.py", line 223, in step denom = exp_avgsq.sqrt().add(group["eps"]) RuntimeError: CUDA out of memory. Tried to allocate 2.08 GiB (GPU 0; 15.78 GiB total capacity; 11.15 GiB already allocated; 2.00 GiB free; 12.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

gmryu commented 2 years ago

RuntimeError: CUDA out of memory (OOM) happens in one gpu. So it is not a multi-gpu problem. Allocating memory is necessary because you have to transfer values from your files to a gpu.

It is strange that a 16GB gpu cannot handle a bart_large. I wonder:

  1. what does the log say before this error happens ?
  2. what is your actual model size? It is written in the log
  3. how long is your sentences?
  4. running with bart_base helps to determine whether it is model too large or a sentence too long.

--

Also, from the example. It used a 32GB gpu for bart_large to run. So I believe each sentence is pretty long here?