Closed roclark closed 3 years ago
BTW, love this repository!
Cheers :). Comments like this help us to continue working on these things on our free-time.
Perhaps I am going about this the wrong way, but I was wondering if there is a reason that the Monitor wrapper in make_vec_env is before the other wrappers?
I understood this is the main question / issue you wanted to raise with this issue? I believe Monitor
is the under-most (the first) wrapper to capture the original amount of steps done and reward gained as seen from the point of view of the environment, rather than some warped result (e.g. fixed frameskip reducing number of steps or, like here, reward shaping changing the episodic reward). This way Monitor
provides true measurements for how well agent is doing in the original task and how many steps it takes to learn it. I see tracking other stats can be useful like pointed out here, but for that you can change the order in which Monitor
is included. There is also the info_keywords
argument to Monitor
which tells which items from info
dictionary should be stored in the csv file at the end of episodes.
Thanks for the quick response! That's good to know about the ordering of the Monitor
and the rationale. I think changing the order of including Monitor
on my end is probably what I would desire here for my specific use-case, but completely understand the current structure.
Is there a built-in way to use make_vec_env
(or similar helper functions) while changing the order that the wrappers (Monitor
in particular) are called? I suppose I could just replicate the functionality of the make_vec_env
function and only take what I need and call in the same order. That'd be simple enough, but ideally I'd like to use as much built-in functionality from the library as possible to minimize application code on my end, but that's not a horrible problem if necessary.
Thanks again!
Perhaps I am going about this the wrong way, but I was wondering if there is a reason that the Monitor wrapper in make_vec_env is before the other wrappers?
The main reason is that your are usually interested in the original reward that has a meaning (e.g. for Atari games) and don't want for instance the clipped/normalized reward to appear in the log.
However, you can use the wrapper_class
of the make_vec_env
to wrap it with a second Monitor
and therefore have access to the modified reward.
(I'm doing that here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/feat/crr/hyperparams/sac.yml#L437 for instance)
Describe the bug I am working with the gym-super-mario-bros environments and created a special wrapper for rewards to better help the agent progress through the level while pursuing the objectives I desire. While using the included Monitor wrapper, I noticed the rewards listed in the
ep_rew_mean
are not being wrapped by my custom reward wrapper. After stepping through the code, I found that the Monitor wrapper is before all of the other wrappers I provide, so it is not getting any modifications to the rewards. I was able to work around this by putting the call to the Monitor wrapper after the custom wrappers in themake_vec_env
function (ie. put line 62 just before thereturn env
line a few lines below).I also tried creating the gym environment manually and wrapping it with my custom rewards before passing to
make_vec_env
, but though the proper rewards are being displayed in the Monitor results, the model doesn't appear to be training and is stuck in random states.Code example Here is an example of an application I wrote which is able to solve the Mario levels (note, requires installing
gym_super_mario_bros
from PyPI). Without making the change to themake_vec_env
function, the incorrect rewards will be displayed in the Monitor output, but the model will successfully train.System Info Describe the characteristic of your environment:
Additional context Perhaps I am going about this the wrong way, but I was wondering if there is a reason that the Monitor wrapper in
make_vec_env
is before the other wrappers? I'm sure there is a perfectly valid reason, but I am unable to get the proper rewards I expect as implemented.If it's easier, here is a diff I created in my fork of the project.
BTW, love this repository! I've been hoping for something like this for a long time, and I enjoy that it's using PyTorch! Thanks for the great work! 😄