Custom reward wrapping in monitored environments

roclark commented 4 years ago

Describe the bug I am working with the gym-super-mario-bros environments and created a special wrapper for rewards to better help the agent progress through the level while pursuing the objectives I desire. While using the included Monitor wrapper, I noticed the rewards listed in the ep_rew_mean are not being wrapped by my custom reward wrapper. After stepping through the code, I found that the Monitor wrapper is before all of the other wrappers I provide, so it is not getting any modifications to the rewards. I was able to work around this by putting the call to the Monitor wrapper after the custom wrappers in the make_vec_env function (ie. put line 62 just before the return env line a few lines below).

I also tried creating the gym environment manually and wrapping it with my custom rewards before passing to make_vec_env, but though the proper rewards are being displayed in the Monitor results, the model doesn't appear to be training and is stuck in random states.

Code example Here is an example of an application I wrote which is able to solve the Mario levels (note, requires installing gym_super_mario_bros from PyPI). Without making the change to the make_vec_env function, the incorrect rewards will be displayed in the Monitor output, but the model will successfully train.

import gym_super_mario_bros
from gym import Wrapper
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from nes_py.wrappers import JoypadSpace
from stable_baselines3 import A2C
from stable_baselines3.common.atari_wrappers import AtariWrapper
from stable_baselines3.common.sb2_compat.rmsprop_tf_like import RMSpropTFLike
from stable_baselines3.common.vec_env import VecFrameStack, VecTransposeImage
from stable_baselines3.common.cmd_util import make_vec_env

class CustomReward(Wrapper):
    def __init__(self, env):
        super(CustomReward, self).__init__(env)
        self._current_score = 0

    def step(self, action):
        state, reward, done, info = self.env.step(action)
        reward += (info['score'] - self._current_score) / 40.0
        self._current_score = info['score']
        if done:
            if info['flag_get']:
                print('We got it!!!!!')
                reward += 350.0
            else:
                reward -= 50.0
        return state, reward / 10.0, done, info

    def reset(self):
        """Reset the environment and return the initial observation."""
        return self.env.reset()

def mario_wrapper(env):
    env = JoypadSpace(env, SIMPLE_MOVEMENT)
    env = AtariWrapper(env, terminal_on_life_loss=False, clip_reward=False)
    env = CustomReward(env)
    return env

env = make_vec_env('SuperMarioBros-1-4-v0', n_envs=16, seed=3994448089, wrapper_class=mario_wrapper)
env = VecFrameStack(env, n_stack=4)
env = VecTransposeImage(env)

model = A2C('CnnPolicy', env, verbose=1, vf_coef=0.25, ent_coef=0.01, policy_kwargs={'optimizer_class': RMSpropTFLike}, tensorboard_log='./mario')
model.learn(total_timesteps=2000000)

System Info Describe the characteristic of your environment:

Describe how the library was installed: PIP (installed version 0.8.0)
GPU models and configuration: GTX 1080 (8GB) with 450.57 driver and CUDA 11.0.
Python version: 3.7.4
PyTorch version: 1.6.0
Gym version: 0.17.2
gym_super_mario_bros: 7.3.2

Additional context Perhaps I am going about this the wrong way, but I was wondering if there is a reason that the Monitor wrapper in make_vec_env is before the other wrappers? I'm sure there is a perfectly valid reason, but I am unable to get the proper rewards I expect as implemented.

If it's easier, here is a diff I created in my fork of the project.

BTW, love this repository! I've been hoping for something like this for a long time, and I enjoy that it's using PyTorch! Thanks for the great work! 😄

Miffyli commented 4 years ago

BTW, love this repository!

Cheers :). Comments like this help us to continue working on these things on our free-time.

Perhaps I am going about this the wrong way, but I was wondering if there is a reason that the Monitor wrapper in make_vec_env is before the other wrappers?

I understood this is the main question / issue you wanted to raise with this issue? I believe Monitor is the under-most (the first) wrapper to capture the original amount of steps done and reward gained as seen from the point of view of the environment, rather than some warped result (e.g. fixed frameskip reducing number of steps or, like here, reward shaping changing the episodic reward). This way Monitor provides true measurements for how well agent is doing in the original task and how many steps it takes to learn it. I see tracking other stats can be useful like pointed out here, but for that you can change the order in which Monitor is included. There is also the info_keywords argument to Monitor which tells which items from info dictionary should be stored in the csv file at the end of episodes.

roclark commented 4 years ago

Thanks for the quick response! That's good to know about the ordering of the Monitor and the rationale. I think changing the order of including Monitor on my end is probably what I would desire here for my specific use-case, but completely understand the current structure.

Is there a built-in way to use make_vec_env (or similar helper functions) while changing the order that the wrappers (Monitor in particular) are called? I suppose I could just replicate the functionality of the make_vec_env function and only take what I need and call in the same order. That'd be simple enough, but ideally I'd like to use as much built-in functionality from the library as possible to minimize application code on my end, but that's not a horrible problem if necessary.

Thanks again!

araffin commented 4 years ago

Perhaps I am going about this the wrong way, but I was wondering if there is a reason that the Monitor wrapper in make_vec_env is before the other wrappers?

The main reason is that your are usually interested in the original reward that has a meaning (e.g. for Atari games) and don't want for instance the clipped/normalized reward to appear in the log.

However, you can use the wrapper_class of the make_vec_env to wrap it with a second Monitor and therefore have access to the modified reward. (I'm doing that here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/feat/crr/hyperparams/sac.yml#L437 for instance)

DLR-RM / stable-baselines3

Custom reward wrapping in monitored environments #146