Closed liruiluo closed 1 year ago
env = PongWrapperori(env)
You try to wrap a vectorized environment with an unvectorized wrapper. You could either vectorize your wrapper or wrap the environment before vectorizing it.
That said, can you give some context to your question? Why do you need to use a wrapper that replaces the observation with a feature vector? Has this feature extractor been previously trained?
env = PongWrapperori(env)
You try to wrap a vectorized environment with an unvectorized wrapper. You could either vectorize your wrapper or wrap the environment before vectorizing it.
I will try it soon
That said, can you give some context to your question? Why do you need to use a wrapper that replaces the observation with a feature vector? Has this feature extractor been previously trained?
Now that I've given some background information on my problem in the first couple of lines
Hello, i actually have a full video (and open source code) about decoupling features extraction and control (applied to rl racing): https://youtu.be/DUqssFvcSOY
code is here: https://github.com/araffin/aae-train-donkeycar/blob/live-twitch-2/ae/wrapper.py
if you want to use multiple envs, you should indeed use a VecEnv wrapper instead (see documentation and source code).
Hello, i actually have a full video (and open source code) about decoupling features extraction and control (applied to rl racing): https://youtu.be/DUqssFvcSOY
code is here: https://github.com/araffin/aae-train-donkeycar/blob/live-twitch-2/ae/wrapper.py
if you want to use multiple envs, you should indeed use a VecEnv wrapper instead (see documentation and source code).
Now that I've watched the YouTube tutorial, it's a really good idea. But I'm a little curious about why the features obtained by the autoencoder from the image can be applied to downstream control tasks? My idea is to train a model end-to-end, and use the rewards during training to shape the encoder, so that the encoder does capture features that are useful for downstream control tasks.
But I'm a little curious about why the features obtained by the autoencoder from the image can be applied to downstream control tasks?
The idea is by learning to de-noise the image, it learns interesting features about the dataset, notably detecting the road, curve and other features that can be re-used for control. It is true that it is not target at control but it will also be more robust to change of illumination, it will be easier to debug (you can reconstruct what was learned and play with each dimension) and easier to transfer from one task to another (as the features can be shared between tasks).
You can read more about self-supervised learning if you want even more examples ;)
❓ Question
Some recent work in reinforcement learning shows that decoupling feature extraction and reinforcement learning can improve the data utilization of reinforcement learning, that is, directly train downstream tasks from trained feature extractors. Inspired by these works, I tried to use the feature extractor trained on the atari game (Pong) in sb3 to directly apply it to downstream policy learning.
I've tried wrapping the environment with model.policy.features_extractor on atari and training a mlp network based agent(This feature extractor comes from the feature extractor that comes with the pre-trained model in rl-baselines3-zoo, which can extract the features of high-dimensional four-frame stacked images into vectors), but have been unsuccessful. Here is a log of my attempts:
First, I selected the pong environment in atari and defined an environment wrapper to extract stacked four-frame images into 512-dimensional features:
(Using the following code, the high-dimensional image can be successfully extracted as a vector of (8, 512)
):
Then, I added this wrapper in the yaml file
atari: env_wrapper:
Run, and the first error occurs:
`Default hyperparameters for environment (ones being tuned will be overridden): OrderedDict([('batch_size', 256), ('clip_range', 'lin_0.1'), ('ent_coef', 0.01), ('env_wrapper', ['stable_baselines3.common.atari_wrappers.AtariWrapper', 'stable_baselines3.common.vec_env.VecTransposeImage', 'rl_zoo3.wrappers.PongWrapperori']), ('frame_stack', 4), ('learning_rate', 'lin_2.5e-4'), ('n_envs', 8), ('n_epochs', 4), ('n_steps', 128), ('n_timesteps', 20000000.0), ('policy', 'MlpPolicy'), ('vf_coef', 0.5)]) Using 8 environments Creating test environment A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd) [Powered by Stella] Traceback (most recent call last): File "train.py", line 4, in
train()
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/train.py", line 259, in train
results = exp_manager.setup_experiment()
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 225, in setup_experiment
self.create_callbacks()
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 539, in create_callbacks
self.create_envs(self.n_eval_envs, eval_env=True),
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 649, in create_envs
monitor_kwargs=self.monitor_kwargs,
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/env_util.py", line 108, in make_vec_env
return vec_env_cls([make_env(i + start_index) for i in range(n_envs)], vec_env_kwargs)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 25, in init
self.envs = [fn() for fn in env_fns]
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 25, in
self.envs = [fn() for fn in env_fns]
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/env_util.py", line 98, in _init
env = wrapper_class(env, wrapper_kwargs)
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/utils.py", line 113, in wrap_env
env = wrapper_class(env, **kwargs)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 42, in init
super().init(venv, observation_space=observation_space)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 257, in init
num_envs=venv.num_envs,
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/gym/core.py", line 238, in getattr
return getattr(self.env, name)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/gym/core.py", line 238, in getattr
return getattr(self.env, name)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/gym/core.py", line 238, in getattr
return getattr(self.env, name)
[Previous line repeated 6 more times]
AttributeError: 'AtariEnv' object has no attribute 'num_envs'
(torchmydsoan) l@l:~/Downloads/rl-baselines3-zoo$
I think the possible reason for this problem is that sb3 wraps the environment first and then stacks the frames by default. So I changed the create_envs function in exp_manager and put the wrapper after the stack frame,and deleted the wrapper in the yaml file。Here is the function:
def create_envs(self, n_envs: int, eval_env: bool = False, no_log: bool = False) -> VecEnv: """ Create the environment and wrap it if necessary.
However, it still gives an error: Default hyperparameters for environment (ones being tuned will be overridden): OrderedDict([('batch_size', 256), ('clip_range', 'lin_0.1'), ('ent_coef', 0.01), ('env_wrapper', ['stable_baselines3.common.atari_wrappers.AtariWrapper']), ('frame_stack', 4), ('learning_rate', 'lin_2.5e-4'), ('n_envs', 8), ('n_epochs', 4), ('n_steps', 128), ('n_timesteps', 20000000.0), ('policy', 'MlpPolicy'), ('vf_coef', 0.5)]) Using 8 environments Creating test environment A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd) [Powered by Stella] Stacking 4 frames Stacking 4 frames Using cuda device Wrapping the env with a
train()
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/train.py", line 269, in train
exp_manager.learn(model)
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 270, in learn
model.learn(self.n_timesteps, **kwargs)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/ppo/ppo.py", line 327, in learn
progress_bar=progress_bar,
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 255, in learn
progress_bar,
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/base_class.py", line 489, in _setup_learn
self._last_obs = self.env.reset() # pytype: disable=annotation-type-mismatch
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 64, in reset
self._save_obs(env_idx, obs)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 94, in _save_obs
self.buf_obs[key][env_idx] = obs
ValueError: could not broadcast input array from shape (8,512) into shape (512)
Monitor
wrapper Wrapping the env in a DummyVecEnv. /home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/ppo/ppo.py:152: UserWarning: You have specified a mini-batch size of 256, but because theRolloutBuffer
is of sizen_steps * n_envs = 128
, after every 0 untruncated mini-batches, there will be a truncated mini-batch of size 128 We recommend using abatch_size
that is a factor ofn_steps * n_envs
. Info: (n_steps=128 and n_envs=1) f"You have specified a mini-batch size of {batch_size}," Log path: logs/ppo/PongNoFrameskip-v4_35 Traceback (most recent call last): File "train.py", line 4, inThis error seems to be because the number of parallel environments is 8, so the shape is (8,512), but my shape is (512). Then, I changed the shape of the wrapper to (8,512), and then ran it, still reporting an error:
Using 8 environments Creating test environment A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd) [Powered by Stella] Stacking 4 frames Stacking 4 frames Using cuda device Wrapping the env with a
train()
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/train.py", line 269, in train
exp_manager.learn(model)
File "/home/l/Downloads/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 270, in learn
model.learn(self.n_timesteps, **kwargs)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/ppo/ppo.py", line 327, in learn
progress_bar=progress_bar,
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 262, in learn
continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 181, in collect_rollouts
new_obs, rewards, dones, infos = env.step(clipped_actions)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 162, in step
return self.step_wait()
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 44, in step_wait
self.actions[env_idx]
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/monitor.py", line 94, in step
observation, reward, done, info = self.env.step(action)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/gym/core.py", line 323, in step
observation, reward, done, info = self.env.step(action)
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 162, in step
return self.step_wait()
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 95, in step_wait
observations, rewards, dones, infos = self.venv.step_wait()
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/vec_frame_stack.py", line 48, in step_wait
observations, rewards, dones, infos = self.venv.step_wait()
File "/home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 44, in step_wait
self.actions[env_idx]
IndexError: invalid index to scalar variable.
Monitor
wrapper Wrapping the env in a DummyVecEnv. /home/l/miniconda3/envs/torchmydsoan/lib/python3.7/site-packages/stable_baselines3/ppo/ppo.py:152: UserWarning: You have specified a mini-batch size of 256, but because theRolloutBuffer
is of sizen_steps * n_envs = 128
, after every 0 untruncated mini-batches, there will be a truncated mini-batch of size 128 We recommend using abatch_size
that is a factor ofn_steps * n_envs
. Info: (n_steps=128 and n_envs=1) f"You have specified a mini-batch size of {batch_size}," Log path: logs/ppo/PongNoFrameskip-v4_36 Traceback (most recent call last): File "train.py", line 4, inNow, I have no idea how tofind a way to train from environments wrapped with the atari feature extractor. Maybe there is an easy way?
Checklist