DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.77k stars 1.67k forks source link

[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477

Closed juhannc closed 3 years ago

juhannc commented 3 years ago

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

If your issue is related to a custom gym environment, please use the custom gym env template.

🐛 Bug

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit the done is respected.

What I mean by that is, that when I create an environment and wrap it into a Monitor and afterwards into a gym.wrappers.TimeLimit to limit the time steps, the evaluate_policy never returns. Instead, it runs the environment for the limited number of steps, then resets it, and finally starts all over again.

As far as I can tell, it happens as follows:

Once the maximum number of steps for TimeLimit are surpassed, it writes not done into the info dict: info['TimeLimit.truncated'] = not done. Note, done should always be False in L19, otherwise the environment would have exited before, making the value in the dict always True. However, it doesn't really matter for us whats inside the dict. Afterwards it sets done to True. Then evaluate_policy checks if the environment is done, which it is. Next, it checks if the environment is wrapped, again, that's true for us. Now the problem is, due to a problem with Atari, the key episode has to be present. However, it is not, but instead the key TimeLimit.truncated is. As the key is not present, evaluate_policy skips this done signal. Thus, we finish the loop and due to the TimeLimit we reset the environment and start over.

See:

https://github.com/openai/gym/blob/0.18.3/gym/wrappers/time_limit.py#L18-L20

and

https://github.com/DLR-RM/stable-baselines3/blob/b52c6fc18fa4b48a259c839e8159b6c9f826e8ad/stable_baselines3/common/evaluation.py#L100-L105

To Reproduce

To reproduce the issue, run the code as follows. In this case it will not exit but instead stay in a loop, see above. To successfully run the code, remove the env = Monitor(env)

#!/usr/bin/env python3

import gym

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

env = gym.make('MountainCar-v0')
env = Monitor(env)
env = gym.wrappers.TimeLimit(env, max_episode_steps=10)

model = PPO(policy='MlpPolicy', env=env, verbose=1)

print("Start evaluating...")
mean_reward, std_reward = evaluate_policy(model=model,
                                          env=env,
                                          n_eval_episodes=1)
print("Evaluation ended")

model.learn(total_timesteps=100)

Expected behavior

Wrapping the monitored environment into a TimeLimit should exit after the maximum number of steps defined in said TimeLimit.

One solution would be for the dictionary check also allow TimeLimit.truncated as a valid key.

 System Info

Describe the characteristic of your environment:

Additional context

Add any other context about the problem here.

Checklist

araffin commented 3 years ago

Hello,

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit

why would you do that and not the other way around? (time limit first and monitor afterward)

juhannc commented 3 years ago

Hello,

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit

why would you do that and not the other way around? (time limit first and monitor afterward)

I did it this way around because make_vec_env did it as well. I realized this bug first when working with vectorized environments but tried to simplify it as much as possible.

Turns out, wrapping it first into TimeLimit and then into Monitor works. Maybe this strategy should be adapted for make_vec_env then as well?

araffin commented 3 years ago

Turns out, wrapping it first into TimeLimit and then into Monitor works.

yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the max_episode_steps parameter: https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L56

Maybe this strategy should be adapted for make_vec_env then as well?

You can pass a callable to make_vec_env (cf doc) so you should be already possible to wrap first your env if needed: https://github.com/DLR-RM/stable-baselines3/blob/18f4e3ace084a2fd3e0a3126613718945cf3e5b5/stable_baselines3/common/env_util.py#L82

juhannc commented 3 years ago

yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the max_episode_steps parameter: https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L56

Thank you, but that way I cannot easily change the number of steps per epsiode. That's why I wanted to use the TimeLimit wrapper manually instead.

You can pass a callable to make_vec_env (cf doc) so you should be already possible to wrap first your env if needed:

https://github.com/DLR-RM/stable-baselines3/blob/18f4e3ace084a2fd3e0a3126613718945cf3e5b5/stable_baselines3/common/env_util.py#L82

I see, I think I'll do that for now. Thanks again!