Closed Simon-Steinmann closed 3 years ago
Hi, I have no experience with HER. Just coming here to say that i'm successfully using a custom webots env with PPO and DQN. I did the same changes as in your "Attempt 1" and then it worked. Did you try out PPO or something else to verify that it's a problem with HER? If you also get an error with PPO it's maybe due to how you call webots. If this is the case I can maybe help you.
Hey @SammyRamone Are you using multiple Webots instances and multi process mpi training? I would love to take a look at your implementation, if it is not proprietary. I'm trying to build a "webots-gym" workflow, to easily create and combine webots environments with stable-baselines and rl-baselines-zoo. Any help would be appreciated.
Also @araffin last night I saw in the stable-baselines documentation, that multiprocessing for HER is only supported with ddpg. It ran (again, only without an eval-environment), but every action was exactly the same. All environments were executing the same thing. Doing it with sac, every agent seemed to act differently. Is there documentation on how the "mpirun" works, and how the training / learning works witih multiple environments? And also, what the difference between using "mpirun" and "n_envs" is. This is not very clear from the documentation.
@Simon-Steinmann I just realized, that this is the git for rl-baslines-zoo, while I'm using rl-baselines3-zoo. So my previous statement is not true for this version. I'm using multiple instances (up to 24 + 1 evaluation env on one machine), but with stable-baselines3 which does not use MPI as far as I know. So maybe that's the difference. You can try to use the newer version. It does not have HER right now, but there is already an open PR for it. The code of my env is currently not (yet) open source and very dirty, but the most important parts are this:
self.sim_proc = subprocess.Popen(["webots",
"--minimize",
"--batch",
"--no-sandbox",
path])
os.environ["WEBOTS_PID"] = str(self.sim_proc.pid)
Where path is the path to your .wbt file
On some systems we had to add the following code to fix issues with the files that webots creates in /tmp
time.sleep(1) # Wait for webots to start
for folder in os.listdir('/tmp'):
if folder.startswith(f'webots-{sim_proc_pid}-'):
try:
os.remove(F'/tmp/webots-{sim_proc_pid}')
except FileNotFoundError:
pass
os.symlink(F'/tmp/{folder}', F'/tmp/webots-{sim_proc_pid}')
Afterwards you should be able to use the supervisor functions of webots to implement your env.
Also beware of a current bug in webots that leads to camera sensors not getting images (but still showing it in the overlay) for some time after using the reset simulation method. It will be patched in the next version, but currently you have to use manual resetting as a workaround.
I hope this helps, if you have further questions I'll try to answer them.
Thank you @SammyRamone, The automatic startup I did pretty much the same. I also discobered the PID startub bug and it is fixed now in the developer branch :). What I'm really curious on though is your process of parallel learning.
Thank you again for your responses, looking forward to more of your insight :)
Thank you @SammyRamone,
You're welcome :)
The automatic startup I did pretty much the same. I also discobered the PID startub bug and it is fixed now in the developer branch :).
Nice.
- Are you using one webots instance and spawn multiple robots in there, or are you creating a webots instance per environment / agent?
I have an gym environment with 1 agent (a humanoid robot) but also other "webots robots" which have their own controller (e.g. a barrier which opens/closes by itself). Each environment has one webots instances. I'm using the SubprocVecEnv of stable-baselines3, so each webots instance is a separate process. It did not work with the DummyVecEnv, probably because webots does not like to have multiple instances in one process.
- What model / policy do you use? For more complex stuff like teaching a robotic arm to do things ( pick & place etc.) I have only experience with HER as a suitable method.
I used Cnn and Mlp policies, the "vanilla" ones that come with stable-baselines3. The agent should learn to find a path through a parkour for a simualtion competition (image). It gets the camera image and outputs the walking commands (which are then translated by a walk engine to joint goals). We first tried PPO (to get continous commands) but it gets "afraid" of falling from the parkour and does not learn well. Afterwards we used discrete walking commands together with DQN and it worked very well.
- Are you directly using stable-baselines3, or do you use some custom setup?
We use stable-baselines3 + the zoo training script. Just added a custom callback for additional tensorflow debug, changed the DummyVecEnv to SubprocVecEnv and did some hacks for the python interface of our walk engine (some strange problem with boost python wrapper).
I hope this anwsers your questions. I not feel free to ask more :)
Thanks again for the reply. A few more questions if you don't mind :)
step()
set up? specifically: What is your timestep of the simulation? How many simulation steps do you do per environment step? What data do you include in the observation? What sort of reward function do you use? And lastly, how fast does the whole thing run? As in fps / training steps per second. I had great success using SAC + HER to train a ur10e robotic arm to do the 'fetch-reach' problem. However, more complicated tasks, such as fetch-push or fetch-pick&place are still elusive. Still trying to find the best general approach. Especially when it comes to parallel learning. I'm wondering, if it may be better, to have a multitude of 'environments' in the same webots world. In your example, that would be several parkour setups spaced apart. Webots does have multithreading. Usually it performs worse, as the simulation is simple, but it could be of benefit in such a scenario. Also, it would use MUCH less resources, potentially allowing for many more workers. Would love to hear your thoughts.
- I ran into the problem, that each Webots instance requires 1.1 GB RAM and 800MB VRAM, which very quickly overwhelms my system. Is this the same for you?
I have 64gb RAM and managed to get 24 instances running at the same time, so that sounds about right. It is not suprising to me, other simulators that I use (PyBullet, Gazebo) take similar amounts of rAM if I remember correctly. Webots forces you to start with a GUI (or at least I did not find a way to do it without), this obviously creates additional overhead.
- How many parallel environments can you run?
24 envs with PPO on a 64gb RAM (+45 gb swap) , 16 core (32 threats) thread ripper CPU and 2 Nvidia RTX 2080.
- Do you use a test / eval environment? Due to the issues described in question 1, I struggle with the resource hogging.
Yes, I do. So its actually 24 + 1 envs.
- How is one
step()
set up? specifically: What is your timestep of the simulation? How many simulation steps do you do per environment step? What data do you include in the observation? What sort of reward function do you use? And lastly, how fast does the whole thing run? As in fps / training steps per second.
The step function is quite typical. Get image from the camera, get action (walking commands) from policy using the image, let the robot perform a step (includes multiple steps of the simulator), compute reward. Timestep is 32 steps per second. One walking step takes around 1 second so I do 32 simulation steps for one learning step. It does not make sense in my context to change walking commands before one walk step is finished. Currently, I'm testing to use "teleportation" of the robot instead of walking to train the policy faster (as this only requires 1 sim step) and then do transfer learning afterward. No results yet. Observation is a 160 120 RGB image. I also played around with providing a 2D pose of the robot and a Mlp policy. Both work (mlp is obviously faster as observation only has 3 dimensions). Reward function distance to goal position scaled linearly to [0,1], where 1 is being at goal position. With actual walking and single DQN env: 8 env steps per second (~832 sim steps). Teleportation with single DQN 80 FPS. For multiple envs I don't have the data right now.
I had great success using SAC + HER to train a ur10e robotic arm to do the 'fetch-reach' problem. However, more complicated tasks, such as fetch-push or fetch-pick&place are still elusive. Still trying to find the best general approach. Especially when it comes to parallel learning. I'm wondering, if it may be better, to have a multitude of 'environments' in the same webots world. In your example, that would be several parkour setups spaced apart. Webots does have multithreading. Usually it performs worse, as the simulation is simple, but it could be of benefit in such a scenario. Also, it would use MUCH less resources, potentially allowing for many more workers. Would love to hear your thoughts.
I don't have much knowledge with robot arms. For learning to walk, PPO worked well on my humanoid robot (but I did that in PyBullet). I saw it multiple times that people placed multiple robots in one simulation. I guess it will save RAM. Personally, I did not do it, because it requires more work and you have to make sure, that you do not have some robots coming into the space of others. You also have more complicated env resets, since you can not just reset the whole simulation. I guess it really depends on the use case if this makes sense or not.
If RAM usage is your problem, you can also decrease the size of your replay buffer. Dependent on how large your observations are, it can grow very large.
Thanks for the information and tips :) I appreciate it!
Closing as this should be fixed in SB3: https://github.com/DLR-RM/rl-baselines3-zoo
If your issue is related to a custom gym environment, please check it first using check_env(env):
Returns no errors
Describe the bug I have a custom environment, that launches an instance of a robot simulator (Webots), and connects to it. This all works perfectly fine, however, each environment needs to be in its own process.
I am able to get it to run with:
mpirun -n 4 python train.py --algo her --env ur10eFetchPushEnv-v0 --eval-freq -1
But as soon as I want an evaluation environment, it doesn't work, since it is created in the same processTroubleshooting attempt 1
I tried circumventing this by changing
training.py
line 258 from:to
However, this leads to the following error:
Troubleshooting attempt 2
I tried following the documentation in stable-baselines:vec_envs
In my hyperparamters, added to
/hyperparams/her.yml
:I declared
n_envs = 3
and started the training with the following command:python train.py --algo her --env ur10eFetchPushEnv-v0
This, again, resulted in errors stemming from my environment, requiring to be run in separate processes. So I tried replaicing
DummyVecEnv
withSubprocVecEnv
again, in thetraining.py
This launches the environments, however I get this error from the training.py:Troubleshooting attempt 3
When I uncomment those lines, it skips this error, but then it complains, because it can't concatenation arrays of different sizes: training.py line 286-290
Error:
My guess is, that I somehow have to wrap each individual environment in the vec_env with a
HERGoalEnvWrapper
, but I have no idea on how to do that. I absolutely love the structure of rl-baselines-zoo, and would love to implement it. Could you point me in the right direction, on how to tackle this issue? I think it's either a bug, missing feature or documentation issue. I'll be happy to do the work and create PRs, once a solution is found.System Info Describe the characteristic of your environment: