ShangtongZhang / DeepRL

Modularized Implementation of Deep RL Algorithms in PyTorch
MIT License
3.21k stars 684 forks source link

CUDA multiprocessing error #82

Closed spacegoing closed 4 years ago

spacegoing commented 4 years ago

Hi Shangtong,

I hit this error when I run ~example.py/dqn_pixel(game=game)~. It seems to be related to CUDA multiprocessing. Do you have some suggestions on how to fix this?

Traceback (most recent call last):
  File "/usr/local/python/3.7.2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                          
    self.run()
  File "/home/chli4934/UsydCodeLab/reinf/DeepRL/deep_rl/agent/BaseAgent.py", line 147, in run                                                                               
    op, data = self.__worker_pipe.recv()
  File "/usr/local/python/3.7.2/lib/python3.7/multiprocessing/connection.py", line 251, in recv                                                                             
    return _ForkingPickler.loads(buf.getbuffer())
  File "/usr/local/python/3.7.2/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 100, in rebuild_cuda_tensor                                          
    torch.cuda._lazy_init()
  File "/usr/local/python/3.7.2/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init                                                                
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method    
spacegoing commented 4 years ago

This seems to be a fix. Would love to know your opinion https://github.com/pytorch/pytorch/issues/1494#issuecomment-305993854

ShangtongZhang commented 4 years ago

I think I have fixed it, at least for the Dockerfile. Did you try it? I remember spawn will lead to some other problems.

spacegoing commented 4 years ago

Hi Shangtong sorry for the late reply. I forgot to reply you.

May I ask what do you mean by fixing it in Dockerfile? I would like to try it but do not know where should I start with.

Thanks!

ShangtongZhang commented 4 years ago

I mean this error shouldn't occur if you use the docker environment, composed by the Dockerfile I provided.

spacegoing commented 4 years ago

Many thanks for your reply! However, my uni's servers do not provide docker service. Would you have any idea how am I able to get it working without docker?

Thanks:D

ShangtongZhang commented 4 years ago

No..

spacegoing commented 4 years ago

Lol. No worries. I ll let you know once I figured something out