HHousen / TransformerSum

Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
https://transformersum.rtfd.io
GNU General Public License v3.0
428 stars 58 forks source link

Training on multi-gpus #49

Closed dongjun-Lee closed 3 years ago

dongjun-Lee commented 3 years ago

Hi,

Thank you for your great work. I'm using the training script in the documentation as below:

python main.py \
--model_name_or_path bert-base-uncased \
--model_type bert \
--data_path ./bert-base-uncased \
--max_epochs 3 \
--accumulate_grad_batches 2 \
--warmup_steps 2300 \
--gradient_clip_val 1.0 \
--optimizer_type adamw \
--use_scheduler linear \
--do_train --do_test \
--batch_size 16

It works well when using single gpu. However, when I use multi-gpus (export CUDA_VISIBLE_DEVICES=0,1), below error occurs.

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 172, in new_process
    results = trainer.run_stage()
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1107, in run_sanity_check
    self.run_evaluation()
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 988, in run_evaluation
    self.evaluation_loop.evaluation_epoch_end(outputs)
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 213, in evaluation_epoch_end
    model.validation_epoch_end(outputs)
  File "/home/bering/git/TransformerSum/src/extractive.py", line 841, in validation_epoch_end
    self.log(name, value, prog_bar=True, sync_dist=True)
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 345, in log
    self.device,
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/core/step_result.py", line 116, in log
    value = sync_fn(value, group=sync_dist_group, reduce_op=sync_dist_op)
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 301, in reduce
    tensor = sync_ddp_if_available(tensor, group, reduce_op=(reduce_op or "mean"))
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 137, in sync_ddp_if_available
    return sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 170, in sync_ddp
    torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
  File "/home/bering/anaconda3/envs/torch1.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

Can I use the training script in multi-gpu setting? Thank you.

HHousen commented 3 years ago

The training script supports multiple GPUs through pytorch-lightning (see the documentation here). To use GPUs 0 and 1, you would specify the --gpus argument to be 2 like so: --gpus 2. Let me know if this helps!