araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.12k stars 206 forks source link

Multithreaded training with SubprocVecEnv() not working #106

Closed Simon-Steinmann closed 2 years ago

Simon-Steinmann commented 3 years ago

If your issue is related to a custom gym environment, please check it first using check_env(env):

Returns no errors

Describe the bug I have a custom environment, that launches an instance of a robot simulator (Webots), and connects to it. This all works perfectly fine, however, each environment needs to be in its own process.

I am able to get it to run with: mpirun -n 4 python train.py --algo her --env ur10eFetchPushEnv-v0 --eval-freq -1 But as soon as I want an evaluation environment, it doesn't work, since it is created in the same process

Troubleshooting attempt 1

I tried circumventing this by changingtraining.py line 258 from:

env = DummyVecEnv([make_env(env_id, 0, args.seed, wrapper_class=env_wrapper, log_dir=log_dir, env_kwargs=env_kwargs)])

to

env = SubprocVecEnv([make_env(env_id, 0, args.seed, wrapper_class=env_wrapper, log_dir=log_dir,  env_kwargs=env_kwargs)])

However, this leads to the following error:

[simon-Legion-Y540:22230] *** Process received signal ***
[simon-Legion-Y540:22230] Signal: Segmentation fault (11)
[simon-Legion-Y540:22230] Signal code: Address not mapped (1)
[simon-Legion-Y540:22230] Failing at address: 0x28
[simon-Legion-Y540:22230] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7eff5e0e38a0]
[simon-Legion-Y540:22230] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pmix_pmix112.so(+0x31433)[0x7eff3b5da433]
[simon-Legion-Y540:22230] [ 2] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x7f1)[0x7eff5e355c31]
[simon-Legion-Y540:22230] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pmix_pmix112.so(+0x2efcd)[0x7eff3b5d7fcd]
[simon-Legion-Y540:22230] [ 4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7eff5e0d86db]
[simon-Legion-Y540:22230] [ 5] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7eff5de01a3f]
[simon-Legion-Y540:22230] *** End of error message ***
Segmentation fault (core dumped)

Troubleshooting attempt 2

I tried following the documentation in stable-baselines:vec_envs

In my hyperparamters, added to /hyperparams/her.yml:

ur10eFetchPushEnv-v0:
  #env_wrapper: utils.wrappers.DoneOnSuccessWrapper
  n_timesteps: !!float 3e6
  policy: 'MlpPolicy'
  model_class: 'sac'
  n_sampled_goal: 4
  goal_selection_strategy: 'future'
  buffer_size: 1000000
  ent_coef: 'auto'
  batch_size: 256
  gamma: 0.95
  learning_rate: !!float 1e-3
  learning_starts: 1000
  train_freq: 1
  n_envs: 3

I declared n_envs = 3 and started the training with the following command: python train.py --algo her --env ur10eFetchPushEnv-v0

This, again, resulted in errors stemming from my environment, requiring to be run in separate processes. So I tried replaicing DummyVecEnv with SubprocVecEnv again, in thetraining.py This launches the environments, however I get this error from the training.py:

Traceback (most recent call last):
  File "train.py", line 296, in <module>
    env = create_env(n_envs)
  File "train.py", line 291, in create_env
    env = _UnvecWrapper(env)
  File "/home/simon/anaconda3/lib/python3.7/site-packages/stable_baselines/common/base_class.py", line 1070, in __init__
    assert venv.num_envs == 1, "Error: cannot unwrap a environment wrapper that has more than one environment."
AssertionError: Error: cannot unwrap a environment wrapper that has more than one environment.

Troubleshooting attempt 3

When I uncomment those lines, it skips this error, but then it complains, because it can't concatenation arrays of different sizes: training.py line 286-290

if args.algo == 'her':
        #    # Wrap the env if need to flatten the dict obs
        #    if isinstance(env, VecEnv):
        #        env = _UnvecWrapper(env)
            env = HERGoalEnvWrapper(env)

Error:

/home/simon/anaconda3/lib/python3.7/site-packages/stable_baselines/common/callbacks.py:287: UserWarning: Training and eval env are not of the same type<stable_baselines.her.utils.HERGoalEnvWrapper object at 0x7f45399daf90> != <stable_baselines.common.vec_env.dummy_vec_env.DummyVecEnv object at 0x7f4538f60350>
  "{} != {}".format(self.training_env, self.eval_env))
Traceback (most recent call last):
  File "train.py", line 419, in <module>
    model.learn(n_timesteps, **kwargs)
  File "/home/simon/anaconda3/lib/python3.7/site-packages/stable_baselines/her/her.py", line 113, in learn
    replay_wrapper=self.replay_wrapper)
  File "/home/simon/anaconda3/lib/python3.7/site-packages/stable_baselines/sac/sac.py", line 378, in learn
    obs = self.env.reset()
  File "/home/simon/anaconda3/lib/python3.7/site-packages/stable_baselines/her/utils.py", line 94, in reset
    return self.convert_dict_to_obs(self.env.reset())
  File "/home/simon/anaconda3/lib/python3.7/site-packages/stable_baselines/her/utils.py", line 71, in convert_dict_to_obs
    return np.concatenate([obs_dict[key] for key in KEY_ORDER])
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 6 and the array at index 1 has size 3

My guess is, that I somehow have to wrap each individual environment in the vec_env with a HERGoalEnvWrapper, but I have no idea on how to do that. I absolutely love the structure of rl-baselines-zoo, and would love to implement it. Could you point me in the right direction, on how to tackle this issue? I think it's either a bug, missing feature or documentation issue. I'll be happy to do the work and create PRs, once a solution is found.

System Info Describe the characteristic of your environment:

SammyRamone commented 3 years ago

Hi, I have no experience with HER. Just coming here to say that i'm successfully using a custom webots env with PPO and DQN. I did the same changes as in your "Attempt 1" and then it worked. Did you try out PPO or something else to verify that it's a problem with HER? If you also get an error with PPO it's maybe due to how you call webots. If this is the case I can maybe help you.

Simon-Steinmann commented 3 years ago

Hey @SammyRamone Are you using multiple Webots instances and multi process mpi training? I would love to take a look at your implementation, if it is not proprietary. I'm trying to build a "webots-gym" workflow, to easily create and combine webots environments with stable-baselines and rl-baselines-zoo. Any help would be appreciated.

Simon-Steinmann commented 3 years ago

Also @araffin last night I saw in the stable-baselines documentation, that multiprocessing for HER is only supported with ddpg. It ran (again, only without an eval-environment), but every action was exactly the same. All environments were executing the same thing. Doing it with sac, every agent seemed to act differently. Is there documentation on how the "mpirun" works, and how the training / learning works witih multiple environments? And also, what the difference between using "mpirun" and "n_envs" is. This is not very clear from the documentation.

SammyRamone commented 3 years ago

@Simon-Steinmann I just realized, that this is the git for rl-baslines-zoo, while I'm using rl-baselines3-zoo. So my previous statement is not true for this version. I'm using multiple instances (up to 24 + 1 evaluation env on one machine), but with stable-baselines3 which does not use MPI as far as I know. So maybe that's the difference. You can try to use the newer version. It does not have HER right now, but there is already an open PR for it. The code of my env is currently not (yet) open source and very dirty, but the most important parts are this:

  1. I specified in the webots world that I use a supervisor node with external controller.
  2. Start webots in the env init like this
    self.sim_proc = subprocess.Popen(["webots",
                                          "--minimize",
                                          "--batch",
                                          "--no-sandbox",
                                          path])
    os.environ["WEBOTS_PID"] = str(self.sim_proc.pid)

Where path is the path to your .wbt file

  1. On some systems we had to add the following code to fix issues with the files that webots creates in /tmp

    time.sleep(1)  # Wait for webots to start 
    for folder in os.listdir('/tmp'): 
    if folder.startswith(f'webots-{sim_proc_pid}-'): 
        try: 
            os.remove(F'/tmp/webots-{sim_proc_pid}') 
        except FileNotFoundError: 
            pass 
        os.symlink(F'/tmp/{folder}', F'/tmp/webots-{sim_proc_pid}') 
  2. Afterwards you should be able to use the supervisor functions of webots to implement your env.

Also beware of a current bug in webots that leads to camera sensors not getting images (but still showing it in the overlay) for some time after using the reset simulation method. It will be patched in the next version, but currently you have to use manual resetting as a workaround.

I hope this helps, if you have further questions I'll try to answer them.

Simon-Steinmann commented 3 years ago

Thank you @SammyRamone, The automatic startup I did pretty much the same. I also discobered the PID startub bug and it is fixed now in the developer branch :). What I'm really curious on though is your process of parallel learning.

  1. Are you using one webots instance and spawn multiple robots in there, or are you creating a webots instance per environment / agent?
  2. What model / policy do you use? For more complex stuff like teaching a robotic arm to do things ( pick & place etc.) I have only experience with HER as a suitable method.
  3. Are you directly using stable-baselines3, or do you use some custom setup?

Thank you again for your responses, looking forward to more of your insight :)

SammyRamone commented 3 years ago

Thank you @SammyRamone,

You're welcome :)

The automatic startup I did pretty much the same. I also discobered the PID startub bug and it is fixed now in the developer branch :).

Nice.

  1. Are you using one webots instance and spawn multiple robots in there, or are you creating a webots instance per environment / agent?

I have an gym environment with 1 agent (a humanoid robot) but also other "webots robots" which have their own controller (e.g. a barrier which opens/closes by itself). Each environment has one webots instances. I'm using the SubprocVecEnv of stable-baselines3, so each webots instance is a separate process. It did not work with the DummyVecEnv, probably because webots does not like to have multiple instances in one process.

  1. What model / policy do you use? For more complex stuff like teaching a robotic arm to do things ( pick & place etc.) I have only experience with HER as a suitable method.

I used Cnn and Mlp policies, the "vanilla" ones that come with stable-baselines3. The agent should learn to find a path through a parkour for a simualtion competition (image). It gets the camera image and outputs the walking commands (which are then translated by a walk engine to joint goals). We first tried PPO (to get continous commands) but it gets "afraid" of falling from the parkour and does not learn well. Afterwards we used discrete walking commands together with DQN and it worked very well.

  1. Are you directly using stable-baselines3, or do you use some custom setup?

We use stable-baselines3 + the zoo training script. Just added a custom callback for additional tensorflow debug, changed the DummyVecEnv to SubprocVecEnv and did some hacks for the python interface of our walk engine (some strange problem with boost python wrapper).

I hope this anwsers your questions. I not feel free to ask more :)

Simon-Steinmann commented 3 years ago

Thanks again for the reply. A few more questions if you don't mind :)

  1. I ran into the problem, that each Webots instance requires 1.1 GB RAM and 800MB VRAM, which very quickly overwhelms my system. Is this the same for you?
  2. How many parallel environments can you run?
  3. Do you use a test / eval environment? Due to the issues described in question 1, I struggle with the resource hogging.
  4. How is onestep() set up? specifically: What is your timestep of the simulation? How many simulation steps do you do per environment step? What data do you include in the observation? What sort of reward function do you use? And lastly, how fast does the whole thing run? As in fps / training steps per second.

I had great success using SAC + HER to train a ur10e robotic arm to do the 'fetch-reach' problem. However, more complicated tasks, such as fetch-push or fetch-pick&place are still elusive. Still trying to find the best general approach. Especially when it comes to parallel learning. I'm wondering, if it may be better, to have a multitude of 'environments' in the same webots world. In your example, that would be several parkour setups spaced apart. Webots does have multithreading. Usually it performs worse, as the simulation is simple, but it could be of benefit in such a scenario. Also, it would use MUCH less resources, potentially allowing for many more workers. Would love to hear your thoughts.

SammyRamone commented 3 years ago
  1. I ran into the problem, that each Webots instance requires 1.1 GB RAM and 800MB VRAM, which very quickly overwhelms my system. Is this the same for you?

I have 64gb RAM and managed to get 24 instances running at the same time, so that sounds about right. It is not suprising to me, other simulators that I use (PyBullet, Gazebo) take similar amounts of rAM if I remember correctly. Webots forces you to start with a GUI (or at least I did not find a way to do it without), this obviously creates additional overhead.

  1. How many parallel environments can you run?

24 envs with PPO on a 64gb RAM (+45 gb swap) , 16 core (32 threats) thread ripper CPU and 2 Nvidia RTX 2080.

  1. Do you use a test / eval environment? Due to the issues described in question 1, I struggle with the resource hogging.

Yes, I do. So its actually 24 + 1 envs.

  1. How is onestep() set up? specifically: What is your timestep of the simulation? How many simulation steps do you do per environment step? What data do you include in the observation? What sort of reward function do you use? And lastly, how fast does the whole thing run? As in fps / training steps per second.

The step function is quite typical. Get image from the camera, get action (walking commands) from policy using the image, let the robot perform a step (includes multiple steps of the simulator), compute reward. Timestep is 32 steps per second. One walking step takes around 1 second so I do 32 simulation steps for one learning step. It does not make sense in my context to change walking commands before one walk step is finished. Currently, I'm testing to use "teleportation" of the robot instead of walking to train the policy faster (as this only requires 1 sim step) and then do transfer learning afterward. No results yet. Observation is a 160 120 RGB image. I also played around with providing a 2D pose of the robot and a Mlp policy. Both work (mlp is obviously faster as observation only has 3 dimensions). Reward function distance to goal position scaled linearly to [0,1], where 1 is being at goal position. With actual walking and single DQN env: 8 env steps per second (~832 sim steps). Teleportation with single DQN 80 FPS. For multiple envs I don't have the data right now.

I had great success using SAC + HER to train a ur10e robotic arm to do the 'fetch-reach' problem. However, more complicated tasks, such as fetch-push or fetch-pick&place are still elusive. Still trying to find the best general approach. Especially when it comes to parallel learning. I'm wondering, if it may be better, to have a multitude of 'environments' in the same webots world. In your example, that would be several parkour setups spaced apart. Webots does have multithreading. Usually it performs worse, as the simulation is simple, but it could be of benefit in such a scenario. Also, it would use MUCH less resources, potentially allowing for many more workers. Would love to hear your thoughts.

I don't have much knowledge with robot arms. For learning to walk, PPO worked well on my humanoid robot (but I did that in PyBullet). I saw it multiple times that people placed multiple robots in one simulation. I guess it will save RAM. Personally, I did not do it, because it requires more work and you have to make sure, that you do not have some robots coming into the space of others. You also have more complicated env resets, since you can not just reset the whole simulation. I guess it really depends on the use case if this makes sense or not.

SammyRamone commented 3 years ago

If RAM usage is your problem, you can also decrease the size of your replay buffer. Dependent on how large your observations are, it can grow very large.

Simon-Steinmann commented 3 years ago

Thanks for the information and tips :) I appreciate it!

araffin commented 2 years ago

Closing as this should be fixed in SB3: https://github.com/DLR-RM/rl-baselines3-zoo