Closed juhannc closed 3 years ago
Hello,
When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit
why would you do that and not the other way around? (time limit first and monitor afterward)
Hello,
When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit
why would you do that and not the other way around? (time limit first and monitor afterward)
I did it this way around because make_vec_env
did it as well.
I realized this bug first when working with vectorized environments but tried to simplify it as much as possible.
Turns out, wrapping it first into TimeLimit
and then into Monitor
works.
Maybe this strategy should be adapted for make_vec_env
then as well?
Turns out, wrapping it first into TimeLimit and then into Monitor works.
yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the max_episode_steps
parameter: https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L56
Maybe this strategy should be adapted for make_vec_env then as well?
You can pass a callable to make_vec_env
(cf doc) so you should be already possible to wrap first your env if needed: https://github.com/DLR-RM/stable-baselines3/blob/18f4e3ace084a2fd3e0a3126613718945cf3e5b5/stable_baselines3/common/env_util.py#L82
yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the
max_episode_steps
parameter: https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L56
Thank you, but that way I cannot easily change the number of steps per epsiode.
That's why I wanted to use the TimeLimit
wrapper manually instead.
You can pass a callable to
make_vec_env
(cf doc) so you should be already possible to wrap first your env if needed:
I see, I think I'll do that for now. Thanks again!
Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.
If your issue is related to a custom gym environment, please use the custom gym env template.
🐛 Bug
When using the
Monitor
class fromstable_baselines3.common.monitor
and wrapping the environment again intogym.wrappers.TimeLimit
thedone
is respected.What I mean by that is, that when I create an environment and wrap it into a
Monitor
and afterwards into agym.wrappers.TimeLimit
to limit the time steps, theevaluate_policy
never returns. Instead, it runs the environment for the limited number of steps, then resets it, and finally starts all over again.As far as I can tell, it happens as follows:
Once the maximum number of steps for
TimeLimit
are surpassed, it writesnot done
into theinfo
dict:info['TimeLimit.truncated'] = not done
. Note,done
should always beFalse
in L19, otherwise the environment would have exited before, making the value in the dict alwaysTrue
. However, it doesn't really matter for us whats inside the dict. Afterwards it setsdone
toTrue
. Thenevaluate_policy
checks if the environment isdone
, which it is. Next, it checks if the environment is wrapped, again, that's true for us. Now the problem is, due to a problem with Atari, the keyepisode
has to be present. However, it is not, but instead the keyTimeLimit.truncated
is. As the key is not present,evaluate_policy
skips thisdone
signal. Thus, we finish the loop and due to theTimeLimit
we reset the environment and start over.See:
https://github.com/openai/gym/blob/0.18.3/gym/wrappers/time_limit.py#L18-L20
and
https://github.com/DLR-RM/stable-baselines3/blob/b52c6fc18fa4b48a259c839e8159b6c9f826e8ad/stable_baselines3/common/evaluation.py#L100-L105
To Reproduce
To reproduce the issue, run the code as follows. In this case it will not exit but instead stay in a loop, see above. To successfully run the code, remove the
env = Monitor(env)
Expected behavior
Wrapping the monitored environment into a
TimeLimit
should exit after the maximum number of steps defined in saidTimeLimit
.One solution would be for the dictionary check also allow
TimeLimit.truncated
as a valid key.System Info
Describe the characteristic of your environment:
pip3 install stable-baselines3[extra]==1.1.0a11
3.8.5
1.8.1+cu102
0.18.0
Additional context
Add any other context about the problem here.
Checklist