SpeechRecognition Task does not produce correct output shape with Deepspeed stage 3

choclatier commented 2 years ago

🐛 Bug

An error is thrown when trying to fit model facebook/wav2vec2-large-robust-ft-swbd-300h using the DeepSpeedPlugin running stage 3.

 File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: output with shape [1] doesn't match the broadcast shape [1024, 64, 128]

Colab test file

Google Colab File : https://colab.research.google.com/drive/1Je0_9D1iWB2C_BGOg-r_kqiX4owuZuro?usp=sharing

To Reproduce

Steps to reproduce the behavior:

import torch
import flash
from torchmetrics.functional.text import wer  
from flash.audio import SpeechRecognition, SpeechRecognitionData
from flash.core.data.utils import download_data
from pytorch_lightning  import  plugins, utilities as ut

# 0. Logger
#logger = loggers.TensorBoardLogger('lightning_logs/')
# 1. Create the DataModule
subset = 10
ut.seed.seed_everything(subset)

download_data("https://pl-flash-data.s3.amazonaws.com/timit_data.zip", "./data")

datamodule = SpeechRecognitionData.from_json(
    input_fields="file",
    target_fields="text",
    train_file="data/timit/train.json",
    test_file="data/timit/test.json",
)
datamodule.batch_size=1

# 2. Build the task
model = SpeechRecognition(  backbone="facebook/wav2vec2-large-robust-ft-swbd-300h")

# 3. Create the trainer and finetune the model

data = "SHE HAD YOUR DARK SUIT IN GREASY WASH WATER ALL YEAR"

trainer = flash.Trainer(amp_level="03",                        
                        max_epochs=1,
                        auto_lr_find=True,
                        accelerator='deepspeed',
                        plugins=plugins.DeepSpeedPlugin(stage=3),
                        gpus=torch.cuda.device_count(),
                        precision=16)

trainer.fit(model, datamodule=datamodule)
model.eval()
model.freeze()
# 4. Predict on audio files!
file = "example.wav"
predictions = model.predict([file])
print(wer("WER: " + str(round(wer([data],predictions)*100))+'%'))

# 5. Save the model!
trainer.save_checkpoint("speech_recognition_model.pt")

# 5. Save the model!
trainer.save_checkpoint("speech_recognition_model.pt")

Error Traceback

File "train.py", line 38, in <module>

  File "/usr/local/lib/python3.7/dist-packages/flash/core/trainer.py", line 128, in fit
    return super().fit(model, train_dataloader, val_dataloaders, datamodule)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
    self._dispatch()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 990, in _dispatch
    self.accelerator.start_training(self)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1000, in run_stage
    return self._run_train()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1049, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 403, in _optimizer_step
    using_lbfgs=is_lbfgs,
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/lightning.py", line 1616, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/optimizer.py", line 206, in step
    self.__optimizer_step(closure=closure, profiler_name=profiler_name, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/optimizer.py", line 128, in __optimizer_step
    trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 293, in optimizer_step
    self.lightning_module, optimizer, opt_idx, lambda_closure, **kwargs
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 46, in pre_optimizer_step
    result = lambda_closure()  # DeepSpeed does not support closures
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
    result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 548, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 589, in backward
    result.closure_loss = self.trainer.accelerator.backward(result.closure_loss, optimizer, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 276, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 66, in backward
    deepspeed_engine.backward(closure_loss, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py", line 1408, in backward
    self.optimizer.backward(loss)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/zero/stage3.py", line 2976, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: output with shape [1] doesn't match the broadcast shape [1024, 64, 128]

Expected behavior

It should start training.

Environment

PyTorch Version: 1.9.1
OS: Ubuntu Linux
How you installed PyTorch (conda, pip, source): pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
Python version: 3.8.10
CUDA/cuDNN version: 11.1
GPU models and configuration: 4 x NVIDIA GEFORCE RTX 3070
Any other relevant information: Intel i9, 64GB of RAM

Additional context

I able to successfully train the base model using fairseq, but I'm trying to train the robust model with stage=3 deepspeed, any help would be appreciated.

choclatier commented 2 years ago

Reproducable with: Google Colab File : https://colab.research.google.com/drive/1Je0_9D1iWB2C_BGOg-r_kqiX4owuZuro?usp=sharing

choclatier commented 2 years ago

Hello @carmocca @awaelchli @rohitgr7 , I realize this is probably a pytorch lightning problem. If you could be kind enough to suggest a fix, I could possibly help fix it. https://github.com/PyTorchLightning/pytorch-lightning/blob/4a4a27db05fe977af0173d00a86f4da230a9e4eb/pytorch_lightning/plugins/precision/deepspeed_precision.py

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Lightning-Universe / lightning-flash