catalyst-team / catalyst

Accelerated deep learning R&D
https://catalyst-team.com
Apache License 2.0
3.3k stars 388 forks source link

ControlFlowCallback error in DDP #1317

Closed ivan-chai closed 3 years ago

ivan-chai commented 3 years ago

🐛 Bug Report

ControlFlowCallback can't be pickled because of lambdas in def _filter_fn_from_loaders.

It works fine when callbacks are initialized in def get_callbacks, but fails if callbacks are passed directly to SupervisedRunner.train method.

File "/usr/local/lib/python3.6/dist-packages/catalyst/runners/runner.py", line 515, in train
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 854, in run
    self._run_event("on_exception")
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 788, in _run_event
    getattr(self, event)(self)
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 780, in on_exception
    raise self.exception
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 850, in run
    self._run_experiment()
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 840, in _run_experiment
    self.engine.spawn(self._run_stage)
  File "/usr/local/lib/python3.6/dist-packages/catalyst/engines/torch.py", line 460, in spawn
    fn, args=(self._world_size,), nprocs=self._world_size, join=True
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
    process.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_filter_fn_from_loaders.<locals>.<lambda>

Environment

Collecting environment information...
Catalyst version: 21.09
PyTorch version: 1.9.1+cu102
Is debug build: No
CUDA used to build PyTorch: 10.2
TensorFlow version: N/A
TensorBoard version: 2.6.0

OS: linux
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 455.45.01
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] catalyst==21.9
[pip3] numpy==1.19.5
[pip3] tensorboard==2.6.0
[pip3] tensorboard-data-server==0.6.1
[pip3] tensorboard-plugin-wit==1.8.0
[pip3] tensorboardX==2.2
[pip3] torch==1.9.1
[pip3] torchvision==0.10.1
[conda] Could not collect
Nimrais commented 3 years ago

I think the problem is that we can't pickle the lambda function. You will need to import dill or something like it and use that instead of the native pickle module.

So probably import dill at the start of your script can help.

ivan-chai commented 3 years ago

I tried dill, but unfortunately multiprocessing and dill can't do together.

ivan-chai commented 3 years ago

One possible solution is to replace lambdas with callable objects.

Scitator commented 3 years ago

so, the custom callable-based functional workaround works for now, right? @ivan-chai

ivan-chai commented 3 years ago

Yes, I made callable and it is fine.

Scitator commented 3 years ago

@ditwoo as a ControlFlow-father do you have any suggestions?

Scitator commented 3 years ago

@asteyo could you please help with an issue? I think some refactoring of filtering-fns from

def _filter_fn_from_XXX({params}):
     {filter-logic}

to

class _filter_fn_from_XXX:
    def __init__(self, {params}):
        pass

    def __call__(self, stage, epoch, loader):
        {filter-logic}

should solve the issue 🚀

Scitator commented 3 years ago

should be fixed with 21.10