clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.75k stars 466 forks source link

Unable to train using rocm with more than 1 gpu #138

Open Wyzix33 opened 1 year ago

Wyzix33 commented 1 year ago

Hi, I just started playing around with donut and wanted to pretrain a new language, I have 3 AMD 6900 XT gpus. I am able to run the trainer with one GPU

Epoch 0:  79%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                  | 3182/4010 [46:44<12:09,  1.13it/s, loss=5.3, v_num=t_ro]

, but if i try to run it using 2 or 3 i get error using this config:

root@server:~/donut# python train.py --config config/train_ro.yaml --exp_version "base"
resume_from_checkpoint_path: None
result_path: ./result
pretrained_model_name_or_path: None
dataset_name_or_paths:
  - dataset/ro_dataset
sort_json_key: False
train_batch_sizes:
  - 3
val_batch_sizes:
  - 2
input_size:
  - 1280
  - 1280
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-05
warmup_steps: 6000
num_training_samples_per_epoch: 1000
max_epochs: 300
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 3
gradient_clip_val: 1.0
verbose: True
exp_name: train_ro
exp_version: base

throws this:

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Traceback (most recent call last):
  File "/root/donut/train.py", line 150, in <module>
    train(config)
  File "/root/donut/train.py", line 134, in train
    trainer.fit(model_module, data_module)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run
    self.__setup_profiler()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1828, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 315, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1966, in broadcast_object_list
    obj_view = object_tensor[offset : offset + obj_size]
IndexError: Dimension specified as 0 but tensor has no dimensions

and using this confing:

root@server:~/donut# python train.py --config config/train_ro.yaml --exp_version "test_ro"
resume_from_checkpoint_path: None
result_path: ./result
pretrained_model_name_or_path: None
dataset_name_or_paths:
  - dataset/ro_dataset
sort_json_key: False
train_batch_sizes:
  - 3
val_batch_sizes:
  - 1
input_size:
  - 1280
  - 1280
max_length: 512
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-05
warmup_steps: 3000
num_training_samples_per_epoch: 1000
max_epochs: 200
max_steps: -1
num_workers: 6
val_check_interval: 1.0
check_val_every_n_epoch: 3
gradient_clip_val: 1.0
verbose: True
exp_name: train_ro
exp_version: test_ro

i get

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Traceback (most recent call last):
  File "/root/donut/train.py", line 150, in <module>
    train(config)
  File "/root/donut/train.py", line 134, in train
    trainer.fit(model_module, data_module)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run
    self.__setup_profiler()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1828, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 315, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1955, in broadcast_object_list
    object_tensor = torch.empty(  # type: ignore[call-overload]
RuntimeError: Trying to create tensor with negative dimension -1: [-1]

Do i need to set something different when using multiple GPUs or is this some rocm problem? Any help pls ...

Wyzix33 commented 1 year ago

small update, if i set
strategy="ddp" to strategy="dp" it's working, but slower ... this is what i get using dp and 2 gpus:

Epoch 0:   3%|████▉      | 60/2005 [03:34<1:55:54,  3.58s/it, loss=29.7, v_num=base]

with ddp and one gpu i get

Epoch 0:   1%|█▊     | 44/4010 [00:40<1:00:38,  1.09it/s, loss=29, v_num=base]