DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.35k stars 1.6k forks source link

Issued when using HER in a custom environment #1958

Closed wilhem closed 1 day ago

wilhem commented 3 days ago

🐛 Bug

I'm trying to use SAC + HER (both from SB3) with a custom environment.

My environment has the following observation and space definitions:

self.observation_space = Dict({'observation': Box(low = np.array([-np.radians(MAX_ANGLE), 0.0, 2.63]),
                                                       high = np.array([np.radians(MAX_ANGLE), np.radians(70), 10.35]), 
                                                       shape = (3,), dtype = np.float32),
                                       'achieved_goal': Box(low = np.array([-np.inf, -np.inf, -np.inf]),
                                                            high = np.array([np.inf, np.inf, np.inf]),
                                                            shape = (3,), dtype = np.float32),
                                       'desired_goal': Box(low = np.array([-np.inf, -np.inf, -np.inf]),
                                                           high = np.array([np.inf, np.inf, np.inf]),
                                                           shape = (3,), dtype = np.float32)
                                       })

 self.action_space = Box(low = np.array([-1.0, -1.0, -1.0]),
                                high = np.array([1.0, 1.0, 1.0]),
                                shape = (3,), dtype = np.float32)

and, as written here, I defined the following methods:

    def compute_reward(self, achieved_goal, desired_goal, info):
        """
        This is an abstract method from Gymnasium:
        https://github.com/Farama-Foundation/Gymnasium-Robotics/blob/a35b1c1fa669428bf640a2c7101e66eb1627ac3a/gym_robotics/core.py#L8
        """
        self.distance = np.linalg.norm(np.array(achieved_goal, dtype = np.float32) - np.array(desired_goal, dtype = np.float32))

        if abs(self.distance) < 0.05:
            reward = np.array(1, dtype = np.float32)
        else:
            reward = np.array(1, dtype = np.float32) 

        return reward

    def compute_terminated(self, achieved_goal, desired_goal, info):
        """
        This is an abstract method from Gymnasium:
        https://github.com/Farama-Foundation/Gymnasium-Robotics/blob/a35b1c1fa669428bf640a2c7101e66eb1627ac3a/gym_robotics/core.py#L8
        """
        boom_extension = self.boom_extension_start + self.arm1_sensor.getValue() + self.arm2_sensor.getValue() + self.arm3_sensor.getValue() + self.arm4_sensor.getValue()

        if boom_extension < self.boom_extension_start or boom_extension > self.boom_extension_max:
            return np.array(1, dtype = np.float32) 

        if abs(self.distance) < 0.05:
            return np.array(1, dtype = np.float32)

        return np.array(0, dtype = np.float32)

    def compute_truncated(self, achieved_goal, desired_goal, info):
        pass

The problem is that running the following code:

    env = MyEnv()

    goal_selection_strategy = "future"
    model_class = SAC

    model = model_class(policy = 'MultiInputPolicy', 
                        env = env, 
                        learning_rate = 2e-4, 
                        learning_starts = 1_000,
                        batch_size = 64,
                        tau = 0.001,
                        gamma = 0.98,
                        replay_buffer_class = HerReplayBuffer,
                        replay_buffer_kwargs = dict(n_sampled_goal = 4,
                                                    goal_selection_strategy = goal_selection_strategy,),
                        ent_coef = 0.4, 
                        device = "cuda", 
                        verbose = 0)

    model = model.learn(total_timesteps = 10_000, log_interval = 1)

gives me the following error:

/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/gym/spaces/box.py:73: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(
/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/vec_env/patch_gym.py:49: UserWarning: You provided an OpenAI Gym environment. We strongly recommend transitioning to Gymnasium environments. Stable-Baselines3 is automatically wrapping your environments in a compatibility layer, which could potentially cause issues.
  warnings.warn(
/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.compute_reward to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.compute_reward` for environment variables or `env.get_wrapper_attr('compute_reward')` that will search the reminding wrappers.
  logger.warn(
Traceback (most recent call last):
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 380, in <module>
    main()
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 336, in main
    model = model.learn(total_timesteps = 10_000, log_interval = 1)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 307, in learn
    return super().learn(
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 254, in train
    target_q_values = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_q_values
RuntimeError: The size of tensor a (14) must match the size of tensor b (64) at non-singleton dimension 0
WARNING: 'supervisor_controller' controller exited with status: 1.

Interestingly I found another user with a similar problem here: link. Running his minimal code leads to the following error:

/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/vec_env/patch_gym.py:49: UserWarning: You provided an OpenAI Gym environment. We strongly recommend transitioning to Gymnasium environments. Stable-Baselines3 is automatically wrapping your environments in a compatibility layer, which could potentially cause issues.
  warnings.warn(
/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.compute_reward to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.compute_reward` for environment variables or `env.get_wrapper_attr('compute_reward')` that will search the reminding wrappers.
  logger.warn(
Traceback (most recent call last):
  File "/home/ubuntu/workspace/src/supervisor_controller.py", line 60, in <module>
    model.learn(3000, progress_bar=False)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 307, in learn
    return super().learn(
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 254, in train
    target_q_values = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_q_values
RuntimeError: The size of tensor a (53) must match the size of tensor b (256) at non-singleton dimension 0

What should I do? Where does the size of tensor (14) comes from? 64 is the batch size, but 14?

Code example

import gymnasium as gym
import numpy as np
from gymnasium import spaces

from stable_baselines3 import A2C
from stable_baselines3.common.env_checker import check_env

class CustomEnv(gym.Env):

    def __init__(self):
        super().__init__()
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(14,))
        self.action_space = spaces.Box(low=-1, high=1, shape=(6,))

    def reset(self, seed=None, options=None):
        return self.observation_space.sample(), {}

    def step(self, action):
        obs = self.observation_space.sample()
        reward = 1.0
        terminated = False
        truncated = False
        info = {}
        return obs, reward, terminated, truncated, info

env = CustomEnv()
check_env(env)

model = A2C("MlpPolicy", env, verbose=1).learn(1000)

Relevant log output / Error message

/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/gym/spaces/box.py:73: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(
/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/vec_env/patch_gym.py:49: UserWarning: You provided an OpenAI Gym environment. We strongly recommend transitioning to Gymnasium environments. Stable-Baselines3 is automatically wrapping your environments in a compatibility layer, which could potentially cause issues.
  warnings.warn(
/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.compute_reward to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.compute_reward` for environment variables or `env.get_wrapper_attr('compute_reward')` that will search the reminding wrappers.
  logger.warn(
Traceback (most recent call last):
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 380, in <module>
    main()
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 336, in main
    model = model.learn(total_timesteps = 10_000, log_interval = 1)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 307, in learn
    return super().learn(
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 254, in train
    target_q_values = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_q_values
RuntimeError: The size of tensor a (14) must match the size of tensor b (64) at non-singleton dimension 0
WARNING: 'supervisor_controller' controller exited with status: 1.


### System Info

- OS: Linux-6.5.0-41-generic-x86_64-with-glibc2.35 # 41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun  3 11:32:55 UTC 2
- Python: 3.10.12
- Stable-Baselines3: 2.3.2
- PyTorch: 2.3.1+cu121
- GPU Enabled: True
- Numpy: 2.0.0
- Cloudpickle: 3.0.0
- Gymnasium: 0.29.1
- OpenAI Gym: 0.21.0

### Checklist

- [X] I have checked that there is no similar [issue](https://github.com/DLR-RM/stable-baselines3/issues) in the repo
- [X] I have read the [documentation](https://stable-baselines3.readthedocs.io/en/master/)
- [X] I have provided a [minimal and working](https://github.com/DLR-RM/stable-baselines3/issues/982#issuecomment-1197044014) example to reproduce the bug
- [X] I have checked my env using the env checker
- [X] I've used the [markdown code blocks](https://help.github.com/en/articles/creating-and-highlighting-code-blocks) for both code and stack traces.
wilhem commented 3 days ago

I forgot to add the output of check_env()

Traceback (most recent call last):
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 383, in <module>
    main()
  File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 319, in main
    print(check_env(env))
  File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 421, in check_env
    assert isinstance(
AssertionError: Your environment must inherit from the gymnasium.Env class cf. https://gymnasium.farama.org/api/env/

My custom environement has the following structure:

File 1:

import gym
from controller import Supervisor

class DeepbotsSupervisorEnv(Supervisor, gym.Env):

File 2:

from deepbots.supervisor.controllers.deepbots_supervisor_env import DeepbotsSupervisorEnv

class RobotSupervisorEnv(DeepbotsSupervisorEnv):

File 3:

from deepbots.supervisor.controllers.robot_supervisor_env import RobotSupervisorEnv

class MyEnv(RobotSupervisorEnv):

env = MyEnv()

So gym.Envshould be seen by the checker. Or not?

(I removed all the constructor code form the examples)

qgallouedec commented 3 days ago

Hey, gym isn't gymnasium. First thing to do is the migration. Check https://gymnasium.farama.org/content/migration-guide/

wilhem commented 3 days ago

Thanks... due to the migration I stuck here

qgallouedec commented 3 days ago

What do you mean? Is your environment integrated with gymnasium now?

wilhem commented 3 days ago

Not yet

araffin commented 1 day ago

duplicate of https://github.com/DLR-RM/stable-baselines3/issues/1959 ?

qgallouedec commented 1 day ago

Yes