DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.35k stars 1.6k forks source link

[Question] Creating a vectorized environment for SB3 #1959

Closed wilhem closed 2 days ago

wilhem commented 3 days ago

❓ Question

I'm having some hard time trying to reimplement my code migrating it to Gymnasium. There are few very interesting questions and answers to this topic, like here, here and here. As far as I understood, a vectorized environment (necessary for HER) is basically an environment, where observations and rewards can be batched, so this is more computationally effective since a batch of trainings will be executed parallel.

So I have the following questions:

  1. The compute_reward method must be vectorized. But here the problem is, that the dimension of the reward (which is an array). Since a dictionary (obs['achieved_goal'] and obs['desired_goal']) is passed to that method, there is batch size at all. So I would expect, that the reward is always a scalar number and not an array with dimension (batch_size, 1). Why? Because when I pass a batch size to the class (let's take PPO for example), then it wouldn't pass batch_size x obs to one method compute_reward, but it would call the compute_reward as many times as batch_size. Or am I wrong? If I'm wrong, then I do not understand, why the step() or get_obs() methods do not output something with the dimension: (batch_size, obs)
  2. Is there any example of a working compute_reward method, which addresses the problem at the point 1? Passing a single scalar or an array does not seem to work.

Checklist

wilhem commented 2 days ago

I'm still fighting against this problem and I do not understand how to get out. My code looks like the following:

    env = MyEnv()

    goal_selection_strategy = "future"
    model_class = SAC

    model = model_class(policy = 'MultiInputPolicy', 
                        env = env, 
                        learning_rate = 2e-4, 
                        buffer_size = 1_000_000,
                        learning_starts = 200,
                        batch_size = 1,
                        tau = 0.95,
                        gamma = 0.98,
                        replay_buffer_class = HerReplayBuffer,
                        replay_buffer_kwargs = dict(n_sampled_goal = 4, goal_selection_strategy = goal_selection_strategy,),
                        ent_coef = 0.3, 
                        device = "cuda", 
                        verbose = 2)

    model = model.learn(total_timesteps = 10_000, log_interval = 1)

    vec_env = model.get_env()
    vec_env = VecNormalize(venv = vec_env, training = True, norm_obs = False, norm_reward = True, clip_obs = 200.0, clip_reward = 1.)

    for i in range(2):

        model.learn(total_timesteps = 10_000, log_interval = 1, reset_num_timesteps = False, tb_log_name = "SAC", progress_bar = False)

The difference now is that I get a vectorized environment from the model self.

    vec_env = model.get_env()
    vec_env = VecNormalize(venv = vec_env, training = True, norm_obs = False, norm_reward = True, clip_obs = 200.0, clip_reward = 1.)

in the hope, that it takes care of the vectorization of all the methods.

Then I almost copied the compute_reward()method from the panda-gym environment:

def compute_reward(self, achieved_goal, desired_goal, info): """ This is an abstract method from Gymnasium: https://github.com/Farama-Foundation/Gymnasium-Robotics/blob/a35b1c1fa669428bf640a2c7101e66eb1627ac3a/gym_robotics/core.py#L8 """ self.distance = np.linalg.norm(np.array(achieved_goal, dtype = np.float32) - np.array(desired_goal, dtype = np.float32))

    reward = -np.array(self.distance > 0.05, dtype = np.float32)

    return reward

but the result is still the following:

Traceback (most recent call last):
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 356, in <module>
    main()
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 342, in main
    model = model.learn(total_timesteps = 10_000, log_interval = 1)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 307, in learn
    return super().learn(
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 254, in train
    target_q_values = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_q_values
RuntimeError: The size of tensor a (14) must match the size of tensor b (64) at non-singleton dimension 0

Still it does not make sense to me the meaning of the error.

Since I passed a batch size of 64, one tensor will have a batch of size 64. And this makes perfectly sense, so for me 64 "forward passes" are generated and the data collected. Then there is a tensor with a batch size of 14, which I do not know, where it come from. It cannot be the rewards because given a batch of size of 64, the compute_reward should run 64 times. But it cannot be the next_q_values, because of the batch_size passed to the class constructor. Even in this case, I would expect 64.

Any help? Thanks

wilhem commented 2 days ago

By the way, even my get_observation() method is almost the same as the method from the panda-gym.

 def get_observations(self):
        obs = {}

        x = self.x0 + self.arm1_sensor.getValue() + self.arm2_sensor.getValue() + self.arm3_sensor.getValue() + self.arm4_sensor.getValue()

        y = self.rotation_sensor.getValue()
        z = self.inclination_sensor.getValue()

        endpoint = self.endpoint.getPosition()
        target = self.target.getPosition()

        obs['observation'] = np.array([x, y, z], dtype = np.float32)
        obs['achieved_goal'] = np.array(endpoint, dtype = np.float32)
        obs['desired_goal'] = np.array(target, dtype = np.float32)

        return obs
qgallouedec commented 2 days ago

No, you don't need to implement a vectorized environment. The only requirement for your environment to be compatible with HER is to have a vectorized compute_reward method.

def _compute_reward_not_vectorized(self, achieved_goal, desired_goal, info={}):
    ...

def compute_reward(self, achieved_goal, desired_goal, info={}):
    if input_batched:
        rewards = []
        for i in range(len(achieved_goal)):
            rewards.append(self._compute_reward_not_vectorized(achieved_goal[i], desired_goal[i]))
        return np.array(rewards)
    else:
        return self._compute_reward_not_vectorized(achieved_goal, desired_goal))

EDIT: updated the code to handle non-batched case

qgallouedec commented 2 days ago

Also, since you're developing your custom env, you need to use the env checker, and only once the checks pass, you can try training.

wilhem commented 2 days ago

Thanks, but I tried everything (even the solution posted by you). But is now says, that

Traceback (most recent call last):
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 372, in <module>
    main()
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 338, in main
    print(check_env(env))
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 473, in check_env
    _check_returned_values(env, observation_space, action_space)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 322, in _check_returned_values
    _check_goal_env_compute_reward(obs, env, float(reward), info)
TypeError: only length-1 arrays can be converted to Python scalars

Now, looking at the line: _check_goal_env_compute_reward(obs, env, float(reward), info)

it expectsfloat(reward) as argument and not an np.array. So I think there should be a bug somewhere, of the check_env does not match with the last SB3 version.

UPDATE: Here the env_checker.pyinstalled on my system:

def _check_goal_env_compute_reward(
    obs: Dict[str, Union[np.ndarray, int]],
    env: gym.Env,
    reward: float,
    info: Dict[str, Any],
) -> None:
    """
    Check that reward is computed with `compute_reward`
    and that the implementation is vectorized.
    """
    achieved_goal, desired_goal = obs["achieved_goal"], obs["desired_goal"]
    assert reward == env.compute_reward(  # type: ignore[attr-defined]
        achieved_goal, desired_goal, info
    ), "The reward was not computed with `compute_reward()`"

    achieved_goal, desired_goal = np.array(achieved_goal), np.array(desired_goal)
    batch_achieved_goals = np.array([achieved_goal, achieved_goal])
    batch_desired_goals = np.array([desired_goal, desired_goal])
    if isinstance(achieved_goal, int) or len(achieved_goal.shape) == 0:
        batch_achieved_goals = batch_achieved_goals.reshape(2, 1)
        batch_desired_goals = batch_desired_goals.reshape(2, 1)
    batch_infos = np.array([info, info])
    rewards = env.compute_reward(batch_achieved_goals, batch_desired_goals, batch_infos)  # type: ignore[attr-defined]
    assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
    assert rewards[0] == reward, f"Vectorized computation of reward differs from single computation: {rewards[0]} != {reward}"
qgallouedec commented 2 days ago

You need to consider the non-batched case as well. I've updated the code example.

wilhem commented 2 days ago

This vectorization seems to be not so easy as expected. I did every change, you suggested, but now I get the following error:

assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
AssertionError: Unexpected shape for vectorized computation of reward: () != (2,)

Which is still not clear to me, what triggers that problem. The reward will be a vector and no more a scalar: ok. But my compute_reward method is not different from other implementations. What is here the problem?

qgallouedec commented 2 days ago

It's hard to say because you're giving traceback without any code. In general, we try to work around an MRE. In fact, it's quite simple: when the method compute_reward takes a batch of observations as input, it must returns an array of rewards; when it receives a single observation, it must returns the reward as a float.

wilhem commented 2 days ago

Thank you very much for your patience. I understood the concept, but if it outputs a float, then I get one error, if it is an array, then an other error. For instance, if I return a float, then I get:


  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 189, in _check_goal_env_compute_reward
    assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
AttributeError: 'int' object has no attribute 'shape'

My code is the following:

def _compute_reward_not_vectorized(self, achieved_goal, desired_goal, info = {}):
        """
        This is an abstract method from Gymnasium:
        https://github.com/Farama-Foundation/Gymnasium-Robotics/blob/a35b1c1fa669428bf640a2c7101e66eb1627ac3a/gym_robotics/core.py#L8
        """
        self.distance = np.linalg.norm(np.array(achieved_goal, dtype = np.float32) - np.array(desired_goal, dtype = np.float32))

        if abs(self.distance) < 0.05:
            reward = 1
        else:
            reward = 0

        return reward

def compute_reward(self, achieved_goal, desired_goal, info = {}):

        if any(isinstance(elem, list) for elem in achieved_goal):

            rewards = []

            for i in range(len(achieved_goal)):
                rewards.append(self._compute_reward_not_vectorized(achieved_goal[i], desired_goal[i]))

            rewards = np.array(rewards, dtype = np.float32).reshape(len(rewards), 1)

            return rewards

        else:

            rewards = self._compute_reward_not_vectorized(achieved_goal, desired_goal)

            return rewards

Now, changing the last line in:

        rewards = self._compute_reward_not_vectorized(achieved_goal, desired_goal)

then I get the following error:

assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
AssertionError: Unexpected shape for vectorized computation of reward: (1, 1) != (2,)
qgallouedec commented 2 days ago

I understood the concept, but if it outputs a float, then I get one error, if it is an array, then an other error.

Ok, I think you don't get my point: the function should handle both cases. That's what we mean by vectorized.

qgallouedec commented 2 days ago
import numpy as np

def _compute_reward_not_vectorized(achieved_goal, desired_goal, info = {}):
    distance = np.linalg.norm(np.array(achieved_goal) - np.array(desired_goal))
    if abs(distance) < 0.05:
        reward = 1
    else:
        reward = 0
    return reward

def compute_reward(achieved_goal, desired_goal, info = {}):
    if len(achieved_goal.shape) == 2:
        rewards = []
        for i in range(len(achieved_goal)):
            rewards.append(_compute_reward_not_vectorized(achieved_goal[i], desired_goal[i]))
        rewards = np.array(rewards)
        return rewards
    else:
        rewards = _compute_reward_not_vectorized(achieved_goal, desired_goal)
        return rewards

achieved_goal = np.array([0.1, 0.1, 0.1])
desired_goal = np.array([0.1, 0.1, 0.1])
print(compute_reward(achieved_goal, desired_goal)) # 1

achieved_goal = np.array([[0.1, 0.1, 0.1], [0.2, 0.2, 0.2]])
desired_goal = np.array([[0.1, 0.1, 0.1], [0.3, 0.3, 0.3]])
print(compute_reward(achieved_goal, desired_goal)) # [1 0]

Edit: use arrays

wilhem commented 2 days ago

Can you run check_env() on it? Your example works on my PC, but running check_env() outputs the following error:

assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
AttributeError: 'int' object has no attribute 'shape'

This is the point. Expected is a float, when reward is one float. But it expects at the same time an object with shape (2,). This is the problem

wilhem commented 2 days ago

I reopened the question, since the problem persists and my allegedly solution led to more errors later. The question is how is it possible to expect that a float has a shape of (2,)

wilhem commented 2 days ago

Solved! For anyone, who has the same issue. The problem is in this checking function:

def _check_goal_env_compute_reward(
    obs: Dict[str, Union[np.ndarray, int]],
    env: gym.Env,
    reward: float,
    info: Dict[str, Any],
) -> None:
    """
    Check that reward is computed with `compute_reward`
    and that the implementation is vectorized.
    """
    achieved_goal, desired_goal = obs["achieved_goal"], obs["desired_goal"]
    assert reward == env.compute_reward(  # type: ignore[attr-defined]
        achieved_goal, desired_goal, info
    ), "The reward was not computed with `compute_reward()`"

    achieved_goal, desired_goal = np.array(achieved_goal), np.array(desired_goal)
    batch_achieved_goals = np.array([achieved_goal, achieved_goal])
    batch_desired_goals = np.array([desired_goal, desired_goal])
    if isinstance(achieved_goal, int) or len(achieved_goal.shape) == 0:
        batch_achieved_goals = batch_achieved_goals.reshape(2, 1)
        batch_desired_goals = batch_desired_goals.reshape(2, 1)
    batch_infos = np.array([info, info])
    rewards = env.compute_reward(batch_achieved_goals, batch_desired_goals, batch_infos)  # type: ignore[attr-defined] 
    assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
    assert rewards[0] == reward, f"Vectorized computation of reward differs from single computation: {rewards[0]} != {reward}"

In the first pass, only obs['achieved_goal'] and obs['desired_goal'] are passed. So they are lists. But in the second pass, they are casted into np.arrays. In short: in your compute_reward, you should check, whether you are getting lists or numpy arrays.

qgallouedec commented 1 day ago

Thanks for sharing but I would disagree, goals can't be lists since the observation space (goals keys) are Box. They must array in all cases