DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.97k stars 1.68k forks source link

HER with Dict Observation Space #1335

Closed tudorjnu closed 1 year ago

tudorjnu commented 1 year ago

❓ Question

Hello,

I am looking to use my custom environment with HER. Currently, it works perfectly with the normal environment and it passes all the checks. The observation space is, however, a dict, as it contains multiple images, joint pos, joint vel and so on. In order to wrap the environment, I added the required method (compute_reward) and I modified the observation space by creating a new Dict observation space with the keys observation, achieved_goal and desired_goal. In doing so, by observation value is basically another Dict.

Is there any way to use this MultiInput kind of policy or is it something not currently supported now? I was also considering making my own custom policy in order to go around the issue. Thank you!

Checklist

qgallouedec commented 1 year ago

Hi,

If I understand correctly, your observation space looks something like:

observation_space = Dict({
    "observation": Dict({
        "joint_pos": Box(...),
        "joint_vel": Box(...),
        "camera": Box(...), # image
    }),
    "desired_goal": Box(...),
    "achieved_goal": Box(...),
})

Is this correct? If so, I'm pretty sure that turning it to

observation_space = Dict({
    "joint_pos": Box(...),
    "joint_vel": Box(...),
    "camera": Box(...), # image
    "desired_goal": Box(...),
    "achieved_goal": Box(...),
})

and using the branch from #704 should work. Please keep me posted.

araffin commented 1 year ago

Is this correct? If so, I'm pretty sure that turning it to

for using it with HER, you might also need to merge all the observations into one box (@qgallouedec I'm not sure if we are using the "observation" explicitly or not now).

In case of doubt, please use the env checker (it should tell you in that case that SB3 doesn't support nested dict).

qgallouedec commented 1 year ago

to merge all the observations into one box

Caution, here you have an image, so merging could be detrimental as the observation won't be preprocessed as an image.

I'm not sure if we are using the "observation" explicitly or not now

I'm pretty sure that we don't. If my solution works for @tudorjnu, I recommend removing "observation" here

https://github.com/DLR-RM/stable-baselines3/blob/12e9917c24dc23d7de7694a924f017c6a8e9a6ce/stable_baselines3/common/env_checker.py#L120

araffin commented 1 year ago

so merging could be detrimental as the observation won't be preprocessed as an image.

unless you merge along the channel axis, no?

If my solution works for @tudorjnu, I recommend removing "observation" here

true.

qgallouedec commented 1 year ago

unless you merge along the channel axis, no?

From what I understand, the observations are multimodals. How would merge a box with shape, let's say (3, 84, 84) (image) with one with shape (7,) (joints position)?

araffin commented 1 year ago

From what I understand, the observations are multimodals. How would merge a box with shape, let's say (3, 84, 84) (image) with one with shape (7,) (joints position)?

true, I misread. Yes, that's what the MultiInputPolicy is for.

tudorjnu commented 1 year ago

Hello all and thank you for the fast responses!

Yes, @qgallouedec, that is correct. My observation space is indeed

observation_space = Dict({
    "observation": Dict({
        "joint_pos": Box(...),
        "joint_vel": Box(...),
        "camera": Box(...), # image
    }),
    "desired_goal": Box(...),
    "achieved_goal": Box(...),
})

so merging as specified above is not a solution.

As a quick update, I have tried the solution

observation_space = Dict({
    "joint_pos": Box(...),
    "joint_vel": Box(...),
    "camera": Box(...), # image
    "desired_goal": Box(...),
    "achieved_goal": Box(...),
})

and I get the following error (I took the image out just to simplify the problem):

AssertionError: A goal conditioned env must contain 3 observation keys: `observation`, `desired_goal`, and `achieved_goal`.The current observation contains 4 keys: ['joint_pos', 'joint_vel', 'achieved_goal', 'desired_goal']

Running the environment yields a key error:

self.obs_shape = get_obs_shape(self.env.observation_space.spaces["observation"])
KeyError: 'observation'

Should I wait for the merge? Thank you again! :)

tudorjnu commented 1 year ago

I used the branch feat/multienv-her-alt from git@github.com:qgallouedec/stable-baselines3.git and I get the following error, although it seems it creates the environment:

/her_replay_buffer.py", line 174, in sample                                                                                                          
sampled_idx = np.random.choice(valid_indices, size=batch_size, replace=True)                                                                                                                                                    
File "mtrand.pyx", line 934, in numpy.random.mtrand.RandomState.choice                                                                                                                                                            
ValueError: 'a' cannot be empty unless no samples are taken

Sorry, I just realised that before I cloned the wrong main branch instead of this one.

qgallouedec commented 1 year ago

I think this error is actually not related to the subject of this issue. It is most likely that you are trying to train the model before the completion of the first episode. You should be able to solve this error by increasing the learning_starts argument of the model. I thought I had already dealt with this in another issue but I can't find it... If the error remains, please open another issue, in your case with the custom env template, it allows us to keep the topics well organized.

tudorjnu commented 1 year ago

Thanks! I set learning_starts at 600 as my environment has a time limit of 300. The error mentioned is solved, however, I get the following:

sac.py", line 245, in train
target_q_values = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_q_values
RuntimeError: The size of tensor a (53) must match the size of tensor b (256) at non-singleton dimension 0 
qgallouedec commented 1 year ago

Can you provide a minimal code to reproduce this? (I recommend using the custom env template)

tudorjnu commented 1 year ago

Sure, no problem. I created this super simple environment that gives me the same error:

import numpy as np

from stable_baselines3 import HerReplayBuffer, SAC

import gym
from gym import spaces

class TestEnv(gym.GoalEnv):
    def __init__(self):

        self.observation_space = spaces.Dict(
            dict(
                an_observation_1=spaces.Box(0, 1, (1,), dtype=np.float32),
                an_observation_2=spaces.Box(0, 1, (1,), dtype=np.float32),
                achieved_goal=spaces.Box(0, 1, (1,), dtype=np.float32),
                desired_goal=spaces.Box(0, 1, (1,), dtype=np.float32),
            )
        )

        self.action_space = spaces.Box(0, 1, (1,), dtype=np.float32)

        self.current_step = 0
        self.ep_length = 10

    def reset(self):
        self.current_step = 0
        state = self._generate_next_state()
        return state

    def step(self, action):
        obs = self._generate_next_state()
        self.current_step += 1
        done = self.current_step >= self.ep_length
        return obs, -1, done, {}

    def _generate_next_state(self):
        state = {}
        for k, space in self.observation_space.spaces.items():
            state[k] = space.sample()
        return state

    def render(self, mode: str = "human") -> None:
        pass

    def compute_reward(self, achieved_goal, desired_goal, info):
        return np.array(-1, np.float32)

env = TestEnv()

model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    learning_starts=1000,
    verbose=1,
)

model.learn(3000, progress_bar=True)
tudorjnu commented 1 year ago

Hello and thanks for the commit!

I just wanted to mention that the top code is still not working. Thanks!

araffin commented 1 year ago

I just wanted to mention that the top code is still not working. Thanks!

which one? the one with nested observation? If you use the env checker, it should explain why it is not working.

tudorjnu commented 1 year ago

This one:

sac.py", line 245, in train
target_q_values = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_q_values
RuntimeError: The size of tensor a (53) must match the size of tensor b (256) at non-singleton dimension 0 

Which makes me realize I should write in the other issue. Sorry for that.

Also, the env_checker will still output an error for the 'observation' not in keys from here. If the solution from the dictionary above is to be implemented, I reckon the verification can be >= 3 rather than ==.