DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.36k stars 1.61k forks source link

`HERReplayBuffer`: Runtime error The size of tensor a (53) must match the size of tensor b (256) at non-singleton dimension 0 #1337

Closed qgallouedec closed 1 year ago

qgallouedec commented 1 year ago

Thanks! I set learning_starts at 600 as my environment has a time limit of 300. The error mentioned is solved, however, I get the following:

sac.py", line 245, in train
target_q_values = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_q_values
RuntimeError: The size of tensor a (53) must match the size of tensor b (256) at non-singleton dimension 0 

Originally posted by @tudorjnu in https://github.com/DLR-RM/stable-baselines3/issues/1335#issuecomment-1434713057

qgallouedec commented 1 year ago

From https://github.com/DLR-RM/stable-baselines3/issues/1335#issuecomment-1434818965

import numpy as np

from stable_baselines3 import HerReplayBuffer, SAC

import gym
from gym import spaces

class TestEnv(gym.GoalEnv):
    def __init__(self):

        self.observation_space = spaces.Dict(
            dict(
                an_observation_1=spaces.Box(0, 1, (1,), dtype=np.float32),
                an_observation_2=spaces.Box(0, 1, (1,), dtype=np.float32),
                achieved_goal=spaces.Box(0, 1, (1,), dtype=np.float32),
                desired_goal=spaces.Box(0, 1, (1,), dtype=np.float32),
            )
        )

        self.action_space = spaces.Box(0, 1, (1,), dtype=np.float32)

        self.current_step = 0
        self.ep_length = 10

    def reset(self):
        self.current_step = 0
        state = self._generate_next_state()
        return state

    def step(self, action):
        obs = self._generate_next_state()
        self.current_step += 1
        done = self.current_step >= self.ep_length
        return obs, -1, done, {}

    def _generate_next_state(self):
        state = {}
        for k, space in self.observation_space.spaces.items():
            state[k] = space.sample()
        return state

    def render(self, mode: str = "human") -> None:
        pass

    def compute_reward(self, achieved_goal, desired_goal, info):
        return np.array(-1, np.float32)

env = TestEnv()

model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    learning_starts=1000,
    verbose=1,
)

model.learn(3000, progress_bar=True)
qgallouedec commented 1 year ago

The problem comes from the compute reward method. It has to be vectorized.

def compute_reward(self, achieved_goal, desired_goal, info):
    return np.zeros(achieved_goal.shape[0], np.float32)

@araffin I'm pretty sure that this error will occur often. I would add a check in env checker, what do you think?

qgallouedec commented 1 year ago

@araffin I'm pretty sure that this error will occur often. I would add a check in env checker, what do you think?

Let me guess, you're going to tell me it's already done ;)

https://github.com/DLR-RM/stable-baselines3/blob/12e9917c24dc23d7de7694a924f017c6a8e9a6ce/stable_baselines3/common/env_checker.py#L129-L138

wilhem commented 5 days ago

The problem comes from the compute reward method. It has to be vectorized.

def compute_reward(self, achieved_goal, desired_goal, info):
    return np.zeros(achieved_goal.shape[0], np.float32)

I got stucked at the same point. Moving to gymnasium I have to vectorized the compute_reward method as expected by SB3.

The problem here is that, since the achieved_goalarray contains 3 scalars, the resulting reward will be the same value repeated three times, like:

[0, 0, 0]

But running the checker it leads to the following error:

  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 322, in _check_returned_values
    _check_goal_env_compute_reward(obs, env, float(reward), info)
TypeError: only length-1 arrays can be converted to Python scalars

How should the reward look like? Expected is a float number (the reward), but due to vectorization, compute_reward will output an array.

qgallouedec commented 5 days ago

Yes the reward should be an array. The function should take a batch of achieved/desired rewards and return an array (1 dim) of rewards