trainer.fit() stuck with accelerator set to "ddp"

ifsheldon commented 3 years ago

🐛 Bug

The problem is that trainer.fit() with accelerator set to ddp takes extremely long time to do something before it can get CPUs and GPUs working. And I cannot interrupt the kernel but have to restart it.

Please reproduce using the BoringModel

To Reproduce

I tried the Boring Model, and I can reproduce the issue.

The only modification I made is in Define the test section. The code is below

def test_x(tmpdir):
    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        progress_bar_refresh_rate=20,
        gpus = 4, # added to use 4 gpus
        accelerator='ddp' # added to use ddp
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)

And my code that initially encountered this issue is in the discussion post.

Expected behavior

The expected behavior is that the training should start in a couple of minutes, but trainer.fit() is stuck while GPUs and CPUs stay idle.

Environment

My environment is below as detected by the official python script. I run my code on a shared GPU cluster after I apply for computation resources. I usually apply for 512GB memory, 32 cores and 4 V100. The environment is managed by my personal conda without messing with others' environment. If you want to know more about the configuration, just let me know.

(torch) [liangf@gpu208-14 liangf]$ python collect_env_details.py
* CUDA:
        - GPU:
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
        - available:         True
        - version:           11.0
* Packages:
        - numpy:             1.20.0
        - pyTorch_debug:     False
        - pyTorch_version:   1.7.1
        - pytorch-lightning: 1.1.8
        - tqdm:              4.56.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.1
        - version:           #1 SMP Tue Nov 17 13:59:11 UTC 2020

Additional context

If I changed the code above in the boring model to the below, the trainer "works" as expected. The trainer with accelerator='dp' takes less than 1min to get everything set up and keeps CPUs and GPUs busy while the one with accelerator='ddp' takes 10min and more and does not successfully get things running before I lose my patience.

def test_x(tmpdir):
    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        progress_bar_refresh_rate=20,
        gpus = 4, # added to use 4 gpus
        accelerator='dp' # changed to use dp instead of ddp
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)

By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment. However, I don't know what is the cause, since I am just transferring from ordinary pytorch to pytorch-lightning, and the code in the notebook looks reasonably good for me.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-9-8b8914eff5a4> in test_x(tmpdir)
     12 
     13     # Train the model ⚡
---> 14     trainer.fit(model, train, val)
     15 
     16     trainer.test(test_dataloaders=test)

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    508         self.call_hook('on_fit_start')
    509 
--> 510         results = self.accelerator_backend.train()
    511         self.accelerator_backend.teardown()
    512 

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train(self)
     55     def train(self):
     56         self.trainer.setup_trainer(self.trainer.model)
---> 57         return self.train_or_test()
     58 
     59     def teardown(self):

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
     72         else:
     73             self.trainer.train_loop.setup_training()
---> 74             results = self.trainer.train()
     75         return results
     76 

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
    559                 with self.profiler.profile("run_training_epoch"):
    560                     # run train epoch
--> 561                     self.train_loop.run_training_epoch()
    562 
    563                 if self.max_steps and self.max_steps <= self.global_step:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    548             # ------------------------------------
    549             with self.trainer.profiler.profile("run_training_batch"):
--> 550                 batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
    551 
    552             # when returning -1 from train_step, we end epoch early

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
    716 
    717                         # optimizer step
--> 718                         self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    719 
    720                     else:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    483 
    484         # model hook
--> 485         model_ref.optimizer_step(
    486             self.trainer.current_epoch,
    487             batch_idx,

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs)
   1296 
   1297         """
-> 1298         optimizer.step(closure=optimizer_closure)
   1299 
   1300     def optimizer_zero_grad(

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in step(self, closure, make_optimizer_step, *args, **kwargs)
    284 
    285         if make_optimizer_step:
--> 286             self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
    287         else:
    288             # make sure to call optimizer_closure when accumulating

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in __optimizer_step(self, closure, profiler_name, *args, **kwargs)
    142         else:
    143             with trainer.profiler.profile(profiler_name):
--> 144                 optimizer.step(closure=closure, *args, **kwargs)
    145 
    146         accelerator_backend = trainer.accelerator_backend

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/lr_scheduler.py in wrapper(*args, **kwargs)
     65                 instance._step_count += 1
     66                 wrapped = func.__get__(instance, cls)
---> 67                 return wrapped(*args, **kwargs)
     68 
     69             # Note that the returned function here is no longer a bound method,

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     24         def decorate_context(*args, **kwargs):
     25             with self.__class__():
---> 26                 return func(*args, **kwargs)
     27         return cast(F, decorate_context)
     28 

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/sgd.py in step(self, closure)
     84         if closure is not None:
     85             with torch.enable_grad():
---> 86                 loss = closure()
     87 
     88         for group in self.param_groups:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in train_step_and_backward_closure()
    706 
    707                         def train_step_and_backward_closure():
--> 708                             result = self.training_step_and_backward(
    709                                 split_batch,
    710                                 batch_idx,

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
    814                 # backward pass
    815                 with self.trainer.profiler.profile("model_backward"):
--> 816                     self.backward(result, optimizer, opt_idx)
    817 
    818                 # hook - call this hook only

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in backward(self, result, optimizer, opt_idx, *args, **kwargs)
    840             self.trainer.accelerator_backend.backward(result, optimizer, opt_idx, *args, **kwargs)
    841         else:
--> 842             result.closure_loss = self.trainer.accelerator_backend.backward(
    843                 result.closure_loss, optimizer, opt_idx, *args, **kwargs
    844             )

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs)
    107             # do backward pass
    108             model = self.trainer.get_model()
--> 109             model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
    110 
    111             # once backward has been applied, release graph

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in backward(self, loss, optimizer, optimizer_idx, *args, **kwargs)
   1160         """
   1161         if self.trainer.train_loop.automatic_optimization or self._running_manual_backward:
-> 1162             loss.backward(*args, **kwargs)
   1163 
   1164     def toggle_optimizer(self, optimizer: Optimizer, optimizer_idx: int):

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    219                 retain_graph=retain_graph,
    220                 create_graph=create_graph)
--> 221         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    222 
    223     def register_hook(self, hook):

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    124 
    125     grad_tensors_ = _tensor_or_tensors_to_tuple(grad_tensors, len(tensors))
--> 126     grad_tensors_ = _make_grads(tensors, grad_tensors_)
    127     if retain_graph is None:
    128         retain_graph = create_graph

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in _make_grads(outputs, grads)
     48             if out.requires_grad:
     49                 if out.numel() != 1:
---> 50                     raise RuntimeError("grad can be implicitly created only for scalar outputs")
     51                 new_grads.append(torch.ones_like(out, memory_format=torch.preserve_format))
     52             else:

RuntimeError: grad can be implicitly created only for scalar outputs

awaelchli commented 3 years ago

I answered in the discussion post about usage of ddp in Jupyter environment. About your second error:

By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment.

The boring model runs fine for multi gpu (current code from this repo), I can confirm that on master and also with your version 1.1.8.

Please post your full boring model as one script so that I can run it.

ifsheldon commented 3 years ago

The only modification I made just gpus=4 and accelerator, but anyway, my notebook is here and you can see the backtrace stack and environment settings. I ran it using Jupyter Lab on a shared GPU cluster. Neither accelerator=dp nor accelerator=ddp_spawn works, which is weird.

sooperset commented 3 years ago

Have you tried pytorch-lightning 1.1.6? For me, after 1.1.7, ddp training is stuck.. I also wonder what a cause is.

tchaton commented 3 years ago

Dear @ifsheldon,

ddp doesn't work in notebook if you are trying to do so.

Best, T.C

IhabBendidi commented 3 years ago

I have had this kind of issue (to note, I'm working on a terminal on a server, so i'm not on a notebook). The training just got stuck after two epochs when using ddp. I tried out a couple of things that didn't work, including lessening the number of workers in the data loader. This only happened when I used 3 gpus. when using two gpus, this didn't happen. Cuda version 10.2. I also tried in another server, with the same issue repeating itself.

Have you tried pytorch-lightning 1.1.6? For me, after 1.1.7, ddp training is stuck.. I also wonder what a cause is.

I have tried uninstalling 1.1.7 and installing 1.1.6 and it worked without any issue !

carmocca commented 3 years ago

Hi @IhabBendidi can you check if this https://github.com/PyTorchLightning/pytorch-lightning/issues/5604#issuecomment-785314359 fixes your problem?

That might be a more appropriate issue than this one

IhabBendidi commented 3 years ago

Hi @IhabBendidi can you check if this #5604 (comment) fixes your problem?

That might be a more appropriate issue than this one

That solved it thanks !

Lightning-AI / pytorch-lightning