Process parallel tf scalar summary hangs

danijar / dreamer

Dream to Control: Learning Behaviors by Latent Imagination

https://danijar.com/dreamer

MIT License

513 stars 110 forks source link

Process parallel tf scalar summary hangs #26

Closed AlexanderKoch-Koch closed 4 years ago

AlexanderKoch-Koch commented 4 years ago

When I use parallel='process', the tf.summary.scalar call in the summarize_episode function doesn't return. If I use 'thread' or 'none' the tf.summary.scalar function finishes normally. I have no idea why. Has anyone encountered this issue already and found a fix?

danijar commented 4 years ago

Hi, thanks for your message. The parallel data collection isn't supported. Feel free to debug it if you're interested, but it's not really needed as Dreamer is data-efficient and the computational bottleneck is training the model

AlexanderKoch-Koch commented 4 years ago

Unfortunately, my environment (RLBench with vision sensor) is about 10x slower than mujoco envs. And Dreamer would spend the majority of the time on data collection. I wasn't really able to solve the problem because there seems to be a CUDA issue when using multiprocessing in python. You could use 'spawn' as multiprocessing start method. However, this doesn't work for us because we have to pass objects which are not picklable. In the end, I have only set the training envs to use multiprocessing and deactivatet tensorboard logging from these envs. This increases the speed of data generation and I can still log from the test env.

danijar commented 4 years ago

Why do you need to pass objects that cannot be pickled? In case you were talking about the environment instance itself, it would be better to instantiate it directly in the separate process. There is also an implementation of a async wrapper in gym3 that might be helpful. Sorry for not being able to help more with this.

AlexanderKoch-Koch commented 4 years ago

In the current implementation, it's trying to pass a lambda function which creates the environment. Lambda functions are usually not picklable. However, this issue can be solved by using the dill package. But I still have to pass the summary writer which doesn't work. You could maybe create a new Summary writer for each process.

The async wrapper in gym3 uses only threads unfortunately. I have to use processes because of my environment.

And I have found out that it's only a problem on machines with a gpu. It works if I hide the GPU. Thank you for trying to help me.