catalyst-team / catalyst

Accelerated deep learning R&D
https://catalyst-team.com
Apache License 2.0
3.3k stars 388 forks source link

`runner.evaluate_loader` does not work with DataParallelEngine #1422

Closed ShuhuaGao closed 2 years ago

ShuhuaGao commented 2 years ago

🐛 Bug Report

How To Reproduce

I have two GPUs and enable both of them. I copied the linear regression minimal example. After that, I checked

runner.engine
# <catalyst.engines.torch.DataParallelEngine at 0x7f67e25d72b0>

Then, the following line produced a long error message:

runner.evaluate_loader(loaders['valid'])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/home/shuhua/GitHub/Learn-DL/catalyst-tutorial/linear-regression.ipynb Cell 10' in <cell line: 1>()
----> [1](vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22575333303930227d/home/shuhua/GitHub/Learn-DL/catalyst-tutorial/linear-regression.ipynb#ch0000009vscode-remote?line=0) runner.evaluate_loader(loaders['valid'])

File ~/miniconda3/lib/python3.9/site-packages/catalyst/runners/runner.py:490, in Runner.evaluate_loader(self, loader, callbacks, model, engine, seed, verbose)
    487     model = self.model
    488 assert model is not None
--> 490 self.train(
    491     model=model,
    492     engine=engine,
    493     loaders=OrderedDict([("valid", loader)]),
    494     num_epochs=1,
    495     verbose=verbose,
    496     callbacks=callbacks,
    497     valid_loader="valid",
    498     seed=seed,
    499 )
    501 return self.loader_metrics

File ~/miniconda3/lib/python3.9/site-packages/catalyst/runners/runner.py:377, in Runner.train(self, loaders, model, engine, criterion, optimizer, scheduler, callbacks, loggers, seed, hparams, num_epochs, logdir, resume, valid_loader, valid_metric, minimize_valid_metric, verbose, timeit, check, overfit, profile, load_best_on_end, cpu, fp16, ddp)
    375 self._load_best_on_end = load_best_on_end
    376 # run
--> 377 self.run()

File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:422, in IRunner.run(self)
    420 except (Exception, KeyboardInterrupt) as ex:
    421     self.exception = ex
--> 422     self._run_event("on_exception")
    423 return self

File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:365, in IRunner._run_event(self, event)
    363     getattr(callback, event)(self)
    364 if is_str_intersections(event, ("_end", "_exception")):
--> 365     getattr(self, event)(self)

File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:357, in IRunner.on_exception(self, runner)
    355 def on_exception(self, runner: "IRunner"):
    356     """Event handler."""
--> 357     raise self.exception

File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:419, in IRunner.run(self)
    413 """Runs the experiment.
    414 
    415 Returns:
    416     self, `IRunner` instance after the experiment
    417 """
    418 try:
--> 419     self._run()
    420 except (Exception, KeyboardInterrupt) as ex:
    421     self.exception = ex

File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:410, in IRunner._run(self)
    408 def _run(self) -> None:
    409     self.engine = self.get_engine()
--> 410     self.engine.spawn(self._run_local)

File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/engine.py:59, in Engine.spawn(self, fn, *args, **kwargs)
     42 def spawn(self, fn: Callable, *args, **kwargs):
     43     """Spawns processes with specified ``fn`` and ``args``/``kwargs``.
     44 
     45     Args:
   (...)
     57         wrapped function (if needed).
     58     """
---> 59     return fn(*args, **kwargs)

File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:405, in IRunner._run_local(self, local_rank, world_size)
    403 self._local_rank, self._world_size = local_rank, world_size
    404 self._run_event("on_experiment_start")
--> 405 self._run_experiment()
    406 self._run_event("on_experiment_end")

File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:399, in IRunner._run_experiment(self)
    397     break
...
  File "/home/shuhua/miniconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)

By contrast, if I use one GPU or CPU via setting os.environ["CUDA_VISIBLE_DEVICES"], then it works.

Pytorch DataParallel supports inference on multiple GPUs, right? I don't understand why evaluate_loader fails with DataParallelEngine.

### Environment # example checklist, fill with your info Catalyst version: 20.04. PyTorch version: 1.11.0 Python version: 3.9 CUDA runtime version: 11.4 Nvidia driver version: 472.39 ```
Scitator commented 2 years ago

hi, thanks for the issue! could you please try using evaluate_loader without any training?) as I see our implementation, we just run the experiment... so the problem could be with transferring model/data from train to validation experiment.

ShuhuaGao commented 2 years ago

I tried

  1. Setting num_epochs=0 in runner.train, and the same error occurred.
  2. Comment out totally the runner.train, and change evaluate_loader to
    runner.evaluate_loader(loaders['valid'], model=model)

Then there was no error, though the code is not useful.

Scitator commented 2 years ago

so, looks like we have some problems with handware-backend 😢 maybe, @ditwoo @bagxi could also review it our :)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.