DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
1.99k stars 509 forks source link

[Question] Finding a way to train from environments wrapped with the atari feature extractor #334

Closed liruiluo closed 1 year ago

liruiluo commented 1 year ago

❓ Question

Some recent work in reinforcement learning shows that decoupling feature extraction and reinforcement learning can improve the data utilization of reinforcement learning, that is, directly train downstream tasks from trained feature extractors. Inspired by these works, I tried to use the feature extractor trained on the atari game (Pong) in sb3 to directly apply it to downstream policy learning.

I've tried wrapping the environment with model.policy.features_extractor on atari and training a mlp network based agent(This feature extractor comes from the feature extractor that comes with the pre-trained model in rl-baselines3-zoo, which can extract the features of high-dimensional four-frame stacked images into vectors), but have been unsuccessful. Here is a log of my attempts:

First, I selected the pong environment in atari and defined an environment wrapper to extract stacked four-frame images into 512-dimensional features:

class PongWrapperori(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        self.model = ALGOS["ppo"].load("/home/l/Downloads/rl-baselines3-zoo/rl-trained-agents/ppo/PongNoFrameskip-v4_1/PongNoFrameskip-v4.zip", device='cpu') 
        self.observation_space = gym.spaces.Box(shape=(512,), low=-np.inf, high=np.inf)

    def observation(self, obs):
        obs = self.model.policy.features_extractor(th.tensor(obs,dtype=th.float,requires_grad=False)).detach().numpy()
        return obs

(Using the following code, the high-dimensional image can be successfully extracted as a vector of (8, 512)

env = make_atari_env("PongNoFrameskip-v4", n_envs=8, seed=0)
# Frame-stacking with 4 frames
env = VecFrameStack(env, n_stack=4)
env = VecTransposeImage(env)
env = PongWrapperori(env)
obs = env.reset()
print(obs)
print(obs.shape)

):

Then, I added this wrapper in the yaml file

atari: env_wrapper:

Run, and the first error occurs:

`Default hyperparameters for environment (ones being tuned will be overridden): OrderedDict([('batch_size', 256), ('clip_range', 'lin_0.1'), ('ent_coef', 0.01), ('env_wrapper', ['stable_baselines3.common.atari_wrappers.AtariWrapper', 'stable_baselines3.common.vec_env.VecTransposeImage', 'rl_zoo3.wrappers.PongWrapperori']), ('frame_stack', 4), ('learning_rate', 'lin_2.5e-4'), ('n_envs', 8), ('n_epochs', 4), ('n_steps', 128), ('n_timesteps', 20000000.0), ('policy', 'MlpPolicy'), ('vf_coef', 0.5)]) Using 8 environments Creating test environment A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd) [Powered by Stella] Traceback (most recent call last): File "train.py", line 4, in train() File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/train.py", line 259, in train results = exp_manager.setup_experiment() File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 225, in setup_experiment self.create_callbacks() File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 539, in create_callbacks self.create_envs(self.n_eval_envs, eval_env=True), File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 649, in create_envs monitor_kwargs=self.monitor_kwargs, File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/env_util.py", line 108, in make_vec_env return vec_env_cls([make_env(i + start_index) for i in range(n_envs)], vec_env_kwargs) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 25, in init self.envs = [fn() for fn in env_fns] File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 25, in self.envs = [fn() for fn in env_fns] File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/env_util.py", line 98, in _init env = wrapper_class(env, wrapper_kwargs) File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/utils.py", line 113, in wrap_env env = wrapper_class(env, **kwargs) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 42, in init super().init(venv, observation_space=observation_space) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 257, in init num_envs=venv.num_envs, File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/gym/core.py", line 238, in getattr return getattr(self.env, name) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/gym/core.py", line 238, in getattr return getattr(self.env, name) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/gym/core.py", line 238, in getattr return getattr(self.env, name) [Previous line repeated 6 more times] AttributeError: 'AtariEnv' object has no attribute 'num_envs' (torchmydsoan) l@l:~/Downloads/rl-baselines3-zoo$

I think the possible reason for this problem is that sb3 wraps the environment first and then stacks the frames by default. So I changed the create_envs function in exp_manager and put the wrapper after the stack frame,and deleted the wrapper in the yaml file。Here is the function:

def create_envs(self, n_envs: int, eval_env: bool = False, no_log: bool = False) -> VecEnv: """ Create the environment and wrap it if necessary.

    :param n_envs:
    :param eval_env: Whether is it an environment used for evaluation or not
    :param no_log: Do not log training when doing hyperparameter optim
        (issue with writing the same file)
    :return: the vectorized environment, with appropriate wrappers
    """
    # Do not log eval env (issue with writing the same file)
    log_dir = None if eval_env or no_log else self.save_path

    # Special case for GoalEnvs: log success rate too
    if (
        "Neck" in self.env_name.gym_id
        or self.is_robotics_env(self.env_name.gym_id)
        or "parking-v0" in self.env_name.gym_id
        and len(self.monitor_kwargs) == 0  # do not overwrite custom kwargs
    ):
        self.monitor_kwargs = dict(info_keywords=("is_success",))

    # Define make_env here so it works with subprocesses
    # when the registry was modified with `--gym-packages`
    # See https://github.com/HumanCompatibleAI/imitation/pull/160
    spec = gym.spec(self.env_name.gym_id)

    def make_env(**kwargs) -> gym.Env:
        env = spec.make(**kwargs)
        return env

    # On most env, SubprocVecEnv does not help and is quite memory hungry
    # therefore we use DummyVecEnv by default
    env = make_vec_env(
        make_env,
        n_envs=n_envs,
        seed=self.seed,
        env_kwargs=self.env_kwargs,
        monitor_dir=log_dir,
        wrapper_class=self.env_wrapper,
        vec_env_cls=self.vec_env_class,
        vec_env_kwargs=self.vec_env_kwargs,
        monitor_kwargs=self.monitor_kwargs,
    )

    if self.vec_env_wrapper is not None:
        env = self.vec_env_wrapper(env)

    # Wrap the env into a VecNormalize wrapper if needed
    # and load saved statistics when present
    env = self._maybe_normalize(env, eval_env)

    # Optional Frame-stacking
    if self.frame_stack is not None:
        n_stack = self.frame_stack
        env = VecFrameStack(env, n_stack)
        if self.verbose > 0:
            print(f"Stacking {n_stack} frames")
    env = VecTransposeImage(env)        ###########This is my new addition
    env = PongWrapperori(env)           ###########  This is my new addition
    if not is_vecenv_wrapped(env, VecTransposeImage):
        wrap_with_vectranspose = False
        if isinstance(env.observation_space, gym.spaces.Dict):
            # If even one of the keys is a image-space in need of transpose, apply transpose
            # If the image spaces are not consistent (for instance one is channel first,
            # the other channel last), VecTransposeImage will throw an error
            for space in env.observation_space.spaces.values():
                wrap_with_vectranspose = wrap_with_vectranspose or (
                    is_image_space(space) and not is_image_space_channels_first(space)
                )
        else:
            wrap_with_vectranspose = is_image_space(env.observation_space) and not is_image_space_channels_first(
                env.observation_space
            )

        if wrap_with_vectranspose:
            if self.verbose >= 1:
                print("Wrapping the env in a VecTransposeImage.")
            env = VecTransposeImage(env)

    return env

However, it still gives an error: Default hyperparameters for environment (ones being tuned will be overridden): OrderedDict([('batch_size', 256), ('clip_range', 'lin_0.1'), ('ent_coef', 0.01), ('env_wrapper', ['stable_baselines3.common.atari_wrappers.AtariWrapper']), ('frame_stack', 4), ('learning_rate', 'lin_2.5e-4'), ('n_envs', 8), ('n_epochs', 4), ('n_steps', 128), ('n_timesteps', 20000000.0), ('policy', 'MlpPolicy'), ('vf_coef', 0.5)]) Using 8 environments Creating test environment A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd) [Powered by Stella] Stacking 4 frames Stacking 4 frames Using cuda device Wrapping the env with a Monitor wrapper Wrapping the env in a DummyVecEnv. /home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/ppo/ppo.py:152: UserWarning: You have specified a mini-batch size of 256, but because the RolloutBuffer is of size n_steps * n_envs = 128, after every 0 untruncated mini-batches, there will be a truncated mini-batch of size 128 We recommend using a batch_size that is a factor of n_steps * n_envs. Info: (n_steps=128 and n_envs=1) f"You have specified a mini-batch size of {batch_size}," Log path: logs/ppo/PongNoFrameskip-v4_35 Traceback (most recent call last): File "train.py", line 4, in train() File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/train.py", line 269, in train exp_manager.learn(model) File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 270, in learn model.learn(self.n_timesteps, **kwargs) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/ppo/ppo.py", line 327, in learn progress_bar=progress_bar, File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 255, in learn progress_bar, File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/base_class.py", line 489, in _setup_learn self._last_obs = self.env.reset() # pytype: disable=annotation-type-mismatch File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 64, in reset self._save_obs(env_idx, obs) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 94, in _save_obs self.buf_obs[key][env_idx] = obs ValueError: could not broadcast input array from shape (8,512) into shape (512)

This error seems to be because the number of parallel environments is 8, so the shape is (8,512), but my shape is (512). Then, I changed the shape of the wrapper to (8,512), and then ran it, still reporting an error:

Using 8 environments Creating test environment A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd) [Powered by Stella] Stacking 4 frames Stacking 4 frames Using cuda device Wrapping the env with a Monitor wrapper Wrapping the env in a DummyVecEnv. /home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/ppo/ppo.py:152: UserWarning: You have specified a mini-batch size of 256, but because the RolloutBuffer is of size n_steps * n_envs = 128, after every 0 untruncated mini-batches, there will be a truncated mini-batch of size 128 We recommend using a batch_size that is a factor of n_steps * n_envs. Info: (n_steps=128 and n_envs=1) f"You have specified a mini-batch size of {batch_size}," Log path: logs/ppo/PongNoFrameskip-v4_36 Traceback (most recent call last): File "train.py", line 4, in train() File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/train.py", line 269, in train exp_manager.learn(model) File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 270, in learn model.learn(self.n_timesteps, **kwargs) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/ppo/ppo.py", line 327, in learn progress_bar=progress_bar, File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 262, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 181, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 162, in step return self.step_wait() File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 44, in step_wait self.actions[env_idx] File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/monitor.py", line 94, in step observation, reward, done, info = self.env.step(action) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/gym/core.py", line 323, in step observation, reward, done, info = self.env.step(action) File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 162, in step return self.step_wait() File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 95, in step_wait observations, rewards, dones, infos = self.venv.step_wait() File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/vec_frame_stack.py", line 48, in step_wait observations, rewards, dones, infos = self.venv.step_wait() File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 44, in step_wait self.actions[env_idx] IndexError: invalid index to scalar variable.

Now, I have no idea how tofind a way to train from environments wrapped with the atari feature extractor. Maybe there is an easy way?

Checklist

qgallouedec commented 1 year ago

env = PongWrapperori(env)

You try to wrap a vectorized environment with an unvectorized wrapper. You could either vectorize your wrapper or wrap the environment before vectorizing it.

qgallouedec commented 1 year ago

That said, can you give some context to your question? Why do you need to use a wrapper that replaces the observation with a feature vector? Has this feature extractor been previously trained?

liruiluo commented 1 year ago

env = PongWrapperori(env)

You try to wrap a vectorized environment with an unvectorized wrapper. You could either vectorize your wrapper or wrap the environment before vectorizing it.

I will try it soon

liruiluo commented 1 year ago

That said, can you give some context to your question? Why do you need to use a wrapper that replaces the observation with a feature vector? Has this feature extractor been previously trained?

Now that I've given some background information on my problem in the first couple of lines

araffin commented 1 year ago

Hello, i actually have a full video (and open source code) about decoupling features extraction and control (applied to rl racing): https://youtu.be/DUqssFvcSOY

code is here: https://github.com/araffin/aae-train-donkeycar/blob/live-twitch-2/ae/wrapper.py

if you want to use multiple envs, you should indeed use a VecEnv wrapper instead (see documentation and source code).

liruiluo commented 1 year ago

Hello, i actually have a full video (and open source code) about decoupling features extraction and control (applied to rl racing): https://youtu.be/DUqssFvcSOY

code is here: https://github.com/araffin/aae-train-donkeycar/blob/live-twitch-2/ae/wrapper.py

if you want to use multiple envs, you should indeed use a VecEnv wrapper instead (see documentation and source code).

Now that I've watched the YouTube tutorial, it's a really good idea. But I'm a little curious about why the features obtained by the autoencoder from the image can be applied to downstream control tasks? My idea is to train a model end-to-end, and use the rewards during training to shape the encoder, so that the encoder does capture features that are useful for downstream control tasks.

araffin commented 1 year ago

But I'm a little curious about why the features obtained by the autoencoder from the image can be applied to downstream control tasks?

The idea is by learning to de-noise the image, it learns interesting features about the dataset, notably detecting the road, curve and other features that can be re-used for control. It is true that it is not target at control but it will also be more robust to change of illumination, it will be easier to debug (you can reconstruct what was learned and play with each dimension) and easier to transfer from one task to another (as the features can be shared between tasks).

You can read more about self-supervised learning if you want even more examples ;)