Zasder3 / train-CLIP

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.
MIT License
653 stars 78 forks source link

Error occurs when using DeepSpeed #13

Closed kobiso closed 3 years ago

kobiso commented 3 years ago

Hi @Zasder3, thank you for the great work!

I was wondering if you tried to use DeepSpeed because I saw this commit log (DeepSpeed Optimizer indexing). When I tried DeepSpeed by adding --plugins deepspeed_stage_2, I've got below errors.

Traceback (most recent call last):
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 8
71, in run_train
    self.train_loop.run_training_epoch()
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
line 499, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
line 743, in run_training_batch
    self._curr_step_result = self.training_step(
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
line 290, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py
", line 204, in training_step
    return self.training_type_plugin.training_step(*args)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.p
y", line 337, in training_step
    return self.model(*args, **kwargs)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _c
all_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in f
orward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _c
all_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deeps
peed.py", line 62, in forward
    return super().forward(*inputs, **kwargs)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 46
, in forward
    output = self.module.training_step(*inputs, **kwargs)
  File "/home/shared/workspace/multimodal-matching/multimodal-matching/train-CLIP/models/wrapper.py", line 106,
 in training_step
    self.manual_backward(loss)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 12
52, in manual_backward
    self.trainer.train_loop.backward(loss, optimizer=None, opt_idx=None, *args, **kwargs)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
line 867, in backward
    self.trainer.accelerator.backward(result, optimizer, opt_idx, should_accumulate, *args, **kwargs)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py
", line 306, in backward
    self.training_type_plugin.pre_backward(closure_loss, should_accumulate, optimizer, optimizer_idx)
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.p
y", line 311, in pre_backward
    if not self.lightning_module.automatic_optimization and self.model.require_backward_grad_sync:
  File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __
getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DeepSpeedEngine' object has no attribute 'require_backward_grad_sync'

The error occurs in the below line, where we use self.automatic_optimization = False. https://github.com/Zasder3/train-CLIP/blob/ab1c59359a8e729fe05fd99aecdddf1eb9f43843/models/wrapper.py#L81

I could use DeepSpeed by self.automatic_optimization = True without self.manual_backward(loss). (But still need some debugging because the training pattern changes.)

My working environment are pytorch=1.9, cuda=11.1, pytorch-lightning=1.3.8. Thanks in advance!

Zasder3 commented 3 years ago

I tried using it in the past per the commits you found. Sadly, I was unable to get it to work due to some lightning limitations. They are actively working on supporting manual optimization with DeepSpeed however it hasn't worked yet for this repo: https://github.com/PyTorchLightning/pytorch-lightning/issues/7957. If you want to try to help out the Lightning team, install the nightly build and open an issue. It may already even be fixed!

Sadly DeepSpeed dropped in my priorities since DDP works perfectly for my needs. If you manage to patch it, feel free to send a pull request!

kobiso commented 3 years ago

Thanks for the reply! I shall check the link and try the nightly version :)