DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.66k stars 1.65k forks source link

[Question] Uncertainty about gymnasium vector environments #1787

Closed ger01d closed 7 months ago

ger01d commented 8 months ago

❓ Question

Since SB3 switched from gym to gymnasium I'm not able to reproduce my results. Maybe I have a major misunderstanding of how to correctly implement bootstrapping with PPO and vectorized environments.

Quick summary of my previous setup:

My custom gym environment is for a quadruped robot learning to walk forward in the simulation environment Pybullet. One episode has 500 steps and the reward-function depends linearly of the x-position of the robot.

The episode ends either when the maximum number of steps = 500 is reached. In this case bootstrapping should be done.

info["TimeLimit.truncated"] = True
done = True

In the other case, if the robot falls, the reward is 0 and without bootstrapping.

info["TimeLimit.truncated"] = False
done = True

In the new gymnasium environment new return variables were introduced: observation, reward, terminated, truncated, info. So I assumed, from how I understood the documentation, that I have to change the above cases to:

Maximum steps reached:

terminated = False 
truncated = True
info["TimeLimit.truncated"] = truncated and not terminated 

Robot fell:

terminated = True 
truncated = False 
info["TimeLimit.truncated"] = truncated and not terminated

For proper training I have to use a vector environment. So when I train the agent I use:

env = make_vec_env(OpenCatGymEnv, n_envs=parallel_env) 
model = PPO('MlpPolicy', env).learn(10.0e6)

Any idea where could be the mistake?

My referenced gym environment can be found here: https://github.com/ger01d/opencat-gym/blob/main/opencat_gym_env.py

Thank you, ger01d

Checklist

araffin commented 7 months ago

Hello,

The episode ends either when the maximum number of steps = 500 is reached. In this case bootstrapping should be done.

If you use the TimeLimit wrapper and the RL Zoo, the behavior should be the same between SB3 versions.

Out of curiosity, what SB3 version were you using? SB3 v1.8?

Relevant code is there: https://github.com/DLR-RM/stable-baselines3/blob/a9273f968eaf8c6e04302a07d803eebfca6e7e86/stable_baselines3/common/vec_env/dummy_vec_env.py#L63-L65

In the new gymnasium environment

With gymnasium, you only need to care about terminated/truncated, you don't have to pass info["TimeLimit.truncated"], SB3 will do the conversion.

In short, truncated=True when you want to bootstrap, terminated=True when the robot crash or reach a goal.

ger01d commented 7 months ago

Hello araffin,

thank you for your response. I didn't know that there's the conversion automatism in SB3. The difference to the older version might be also because I did not seed properly.

Regarding your question: Currently I'm using SB3 2.2.1. Before I was using SB3 1.6.2.