Proper way to parallelize envs

sparisi commented 3 years ago

My code does something like:

from torch import multiprocessing as mp
...
ctx = mp.get_context('fork')
...
env = iGibsonEnv(...) # To get observation and action spaces
...
for i in range(n):
    actor = ctx.Process(
        target=act,
        args=(...))
    actor.start()

Within act() I try to create a new environment, but it fails as reported by https://github.com/StanfordVL/iGibson/issues/71#issue-839396005 Unlike what the user reported, putting the process to sleep to few seconds did not fix the issue.

The only way to run my code is to create new environments outside act() and pass them as arguments:

for i in range(n):
    actor = ctx.Process(
        target=act,
        args=(iGibsonEnv(...), ...))
    actor.start()

However, the code is extremely slow.

Creating envs gets slower and slower as new envs are created.
Creating too many envs gives me cuda out of memory error.
Even if I create few envs (e.g., 5) and manage to run my code, everything is way too slow, and using only 1 env is actually much faster.

I tried using ParallelNavEnv, but it creates envs in other processes which is not what I want.

What is the proper way to parallelize envs in this case? Thanks!

Env details: This with 64x64 image size
OS details: Ubuntu 20.04.1 LTS, CUDA Version: 11.0

mjlbach commented 3 years ago

There are several examples now parallelizing with python multiprocessing (stable-baselines3) and ray: https://github.com/StanfordVL/iGibson/blob/ig-develop/igibson/examples/demo/stable_baselines3_example.py https://github.com/StanfordVL/iGibson/blob/ig-develop/igibson/envs/igibson_rllib_env.py

mjlbach commented 3 years ago

Closing because I believe this is addressed with the above examples

sycz00 commented 2 years ago

@mjlbach Hey, is the stable_baselines3_example.py supposed to work after 1M steps ? Iam currently trying to create some unit tests to verify that every libary works. Such an example would be extremely helpful :) Thanks in advance

mjlbach commented 2 years ago

Are you having an issue with it past 1 million steps?

sycz00 commented 2 years ago

No, I haven't finished training yet. I wanted to know if that is an serious example using SB3. Since training 1M steps needs some time on my resources.

mjlbach commented 2 years ago

stable_baselines3_example.py should converge, stable_baselines3_behavior_example.py will not

sycz00 commented 2 years ago

alright, thank you :) I am gonna try it out tonight and let you know if it works as expected :) Edit: do you have any graphs for the example ? not sure if this is correct. Screenshot from 2021-11-26 10-05-02 (1)

Based on 50 evaluation episodes, the agent achieves ~ 50 % success rate. Is this the expected rate more or less ? Still thanks for your help !

StanfordVL / iGibson

Proper way to parallelize envs #77