NVIDIA / MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
https://nvidia.github.io/MinkowskiEngine
Other
2.47k stars 367 forks source link

MinkowskiConvolutionFunction cannot support multi-gpu? #503

Open yichaoshen-MS opened 2 years ago

yichaoshen-MS commented 2 years ago

Describe the bug When use multi-gpu to train network by pytorch-lighting, it meets error may because "MinkowskiConvolutionFunction" can not pickle("TypeError: cannot pickle 'MinkowskiConvolutionFunction' object")

What's more, I found this PR(https://github.com/NVIDIA/MinkowskiEngine/pull/139) asserting fixing this bug and has been merged, but it seems to be not work and I cannot find the change of this PR in latest coda(-v 0.5.4).


File "/root/anaconda3/envs/mask3d/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch mp.start_processes( File "/root/anaconda3/envs/mask3d/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes process.start() File "/root/anaconda3/envs/mask3d/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/root/anaconda3/envs/mask3d/lib/python3.10/multiprocessing/context.py", line 288, in _Popen return Popen(process_obj) File "/root/anaconda3/envs/mask3d/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/root/anaconda3/envs/mask3d/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/root/anaconda3/envs/mask3d/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/root/anaconda3/envs/mask3d/lib/python3.10/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: cannot pickle 'MinkowskiConvolutionFunction' object


Desktop (please complete the following information):

mcmingchang commented 1 year ago

same problem

Charlescai123 commented 1 year ago

same

vgthengane commented 1 year ago

Has anyone solved this?

ttlzfhy commented 1 year ago

same problem

ttlzfhy commented 1 year ago

For Minkowski Engine version 0.5.4, I tried to change MinkowskiConvolution.py as https://github.com/NVIDIA/MinkowskiEngine/pull/139 although the codes are slightly different.

Specifically, I deleted line 281 to line 285 (self,conv in the__init__() function of class MinkowskiConvolutionBase), and deleted line 314 to line 322 (outfeat = self.conv.apply(.....) in the forward() function of class MinkowskiConvolutionBase).

And I added those codes after the line 322:

            if self.is_transpose:
                conv = MinkowskiConvolutionTransposeFunction()
            else:
                conv = MinkowskiConvolutionFunction()
            outfeat = conv.apply(
                input.F,
                self.kernel,
                self.kernel_generator,
                self.convolution_mode,
                input.coordinate_map_key,
                out_coordinate_map_key,
                input._manager,
            )

In this way, the 'TypeError: cannot pickle 'MinkowskiConvolutionFunction' object' seems to be fixed.

However, another error appears:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1151, in __setstate__
    self.__dict__.update(state)
ValueError: dictionary update sequence element #0 has length 12; 2 is required
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1151, in __setstate__
    self.__dict__.update(state)
ValueError: dictionary update sequence element #0 has length 12; 2 is required
Traceback (most recent call last):
  File "/home/gaolinyao/sparsepcgc/examples/multigpu_lightning.py", line 208, in <module>
    trainer.fit(pl_module)
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/gaolinyao/anaconda3/envs/py91/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 139, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 1

Has anyone solved this?????