Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.56k stars 3.4k forks source link

Support `DDP(static_graph=True)` and gradient accumulation #19354

Open nousr opened 10 months ago

nousr commented 10 months ago

I got SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f22c43552b0> returned NULL without setting an error when setting accumulate_grad_batches = 2. But I see nothing helpful in the log.

Error gone when changing DDPStrategy(static_graph=False,), or accumulate_grad_batches back to 1, or batch_size=3(total len(data) = 9).

I wonder if there is some conflict between DDPStrategy.static_graph=True, accumulate_grad_batches and batch_size.

I want to keep static_graph=True because I am using .gradient_checkpointing_enable().

Anyone helps, please.

Epoch 0:  40%|█████████████████████████████▏                                           | 2/5 [00:00<00:01,  2.05it/s, v_num=0]Traceback (most recent call last):
  File "untitled.py", line 62, in <module>
    trainer.fit(MM, train_dataloaders=train_loader)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 935, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 978, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 218, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 178, in run
    closure()
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 135, in closure
    self._backward_fn(step_output.closure_loss)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 233, in backward_fn
    call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 288, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 199, in backward
    self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 67, in backward
    model.backward(tensor, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/core/module.py", line 1054, in backward
    loss.backward(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f22c43552b0> returned NULL without setting an error
Epoch 0:  40%|████      | 2/5 [00:02<00:03,  1.05s/it, v_num=0]

Minimal code to reproduce the error:

import torch
from transformers import BertTokenizer, BertForSequenceClassification
import lightning.pytorch as pl
from lightning.pytorch.strategies import DDPStrategy

name = "hfl/chinese-roberta-wwm-ext"

class AAA(pl.LightningModule):
    def __init__(self, **kwargs):
        super().__init__()
        self.model = BertForSequenceClassification.from_pretrained(name, num_labels=2)

    def forward(self, *inputs):
        outputs = self.model(inputs[0], attention_mask=inputs[1], labels=inputs[2])
        loss = outputs.loss
        return (loss, outputs)

    def training_step(self, batch, batch_idx):
        outputs = self(*batch)
        loss = outputs[0]
        return loss

    def configure_optimizers(self):        
        return torch.optim.Adam(self.model.parameters(), lr=1e-5)

MM = AAA()
train_texts = ['这是第一条训练数据', '这是第二条训练数据', '这是第三条训练数据', '这是第四条训练数据', '这是第五条训练数据', '这是第六条训练数据', '这是第七条训练数据', '这是第八条训练数据', '这是第九条训练数据']
train_labels = [1, 0, 1, 1, 0, 1, 1, 1, 1]
tokenizer = BertTokenizer.from_pretrained(name)
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
train_dataset = torch.utils.data.TensorDataset(
    torch.tensor(train_encodings['input_ids']),
    torch.tensor(train_encodings['attention_mask']),
    torch.tensor(train_labels)
)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=True)

trainer = pl.Trainer(
    accelerator="auto",
    devices="auto",
    strategy=DDPStrategy(static_graph=True,),
    precision="16-mixed",
    num_sanity_val_steps=0,
    max_epochs=10,
    deterministic="warn",
    accumulate_grad_batches=2,
)

trainer.fit(MM, train_dataloaders=train_loader)

Environment:

torch                     2.0.1
torchaudio                2.0.2
torchvision               0.15.2
lightning                 2.0.2
transformers              4.30.2

Originally posted by @iamlockelightning in https://github.com/Lightning-AI/pytorch-lightning/discussions/18080


I'm also observing this issue in the latest version of pytorch-lightning (2.1.3)

cc @justusschock @awaelchli

nik777 commented 10 months ago

@awaelchli If help is still wanted please assign this issue to me. Have a bit of time to work on it.

awaelchli commented 10 months ago

Of course, @nik777, please go ahead, that would be great! No to discourage you of course, but I think it might be a hard one to solve :)

tsteternlieb commented 6 months ago

Any progress on this? Thank's so much!