araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.12k stars 206 forks source link

[Question] When should TimeFeatureWrapper be used? #79

Closed PierreExeter closed 4 years ago

PierreExeter commented 4 years ago

Hello,

In the tuned hyperparameters yml files, I noticed that some environments are wrapped with TimeFeatureWrapper. This is the case for most environments trained with TD3 (and TRPO) but not for the other algorithms. How do you decide when the environment should be wrapped in a TimeFeatureWrapper?

I understand from this paper that this wrapper is necessary for environments with a fixed number of time steps so that they respect the Markov property.

To give more context, I would like to compare the performance of TD3 and A2C for a same environment over an equal number of time steps per episode. If I train with TimeFeatureWrapper, the episode lengths are not guarantied to be equal so comparing the mean reward per episode doesn't make sense anymore. If I train without the wrapper, I may violate the Markov property.

Thanks, Pierre

araffin commented 4 years ago

How do you decide when the environment should be wrapped in a TimeFeatureWrapper? I understand from this paper that this wrapper is necessary for environments with a fixed number of time steps so that they respect the Markov property.

As a rule of thumb, use it on every environment with fixed episode length. The impact is more or less big depending on the algorithm (and some hyperparameters in the zoo are not completely up to date, hence the inconsistency)

If I train with TimeFeatureWrapper, the episode lengths are not guarantied to be equal so comparing the mean reward per episode doesn't make sense anymore.

the wrapper just add a feature, it should not change the environment, and a termination condition can also be satisfied before the max episode length.

PierreExeter commented 4 years ago

Thanks!

Ok I understand that with this wrapper, it's no longer possible to have a fixed episode length.

However this poses problem during the evaluation, when computing the mean cumulative return over X episodes. The longer episodes will be run for more time steps and will receive a higher reward than those terminated earlier (this is the case in ReacherBulletEnv-v0 where a positive reward is given at each time step). In the enjoy.py function, it is only possible to evaluate for a fixed number of time steps. With this setting, some evaluation episodes will receive a higher reward than others, depending on when they are terminated.

How can I ensure a fair evaluation when using TimeFeatureWrapper?

(Note, in my environment, the only termination condition is defined by max_episode_steps=150)

araffin commented 4 years ago

Ok I understand that with this wrapper, it's no longer possible to have a fixed episode length.

Why?

PierreExeter commented 4 years ago

Because the environment's termination condition (max_episode_steps=150) is overridden by max_steps=1000 from the TimeFeatureWrapper class. Should I force done=True after 150 steps when evaluating? Sorry if I'm missing the point.

araffin commented 4 years ago

Because the environment's termination condition (max_episode_steps=150) is overridden by max_steps=1000

It is not overriden, it is only used to compute the feature, not to compute termination: https://github.com/araffin/rl-baselines-zoo/blob/master/utils/wrappers.py#L71

and it uses the one defined by th env if present: https://github.com/araffin/rl-baselines-zoo/blob/master/utils/wrappers.py#L48

PierreExeter commented 4 years ago

Apologies, I forgot to reply. The issue I had with the episode length was due to an implementation problem in my custom environment, not to the TimeFeatureWrapper. Thanks for the clarification above about when to use the wrapper. I'm happy to have this issue closed.