Closed wilhem closed 2 days ago
I'm still fighting against this problem and I do not understand how to get out. My code looks like the following:
env = MyEnv()
goal_selection_strategy = "future"
model_class = SAC
model = model_class(policy = 'MultiInputPolicy',
env = env,
learning_rate = 2e-4,
buffer_size = 1_000_000,
learning_starts = 200,
batch_size = 1,
tau = 0.95,
gamma = 0.98,
replay_buffer_class = HerReplayBuffer,
replay_buffer_kwargs = dict(n_sampled_goal = 4, goal_selection_strategy = goal_selection_strategy,),
ent_coef = 0.3,
device = "cuda",
verbose = 2)
model = model.learn(total_timesteps = 10_000, log_interval = 1)
vec_env = model.get_env()
vec_env = VecNormalize(venv = vec_env, training = True, norm_obs = False, norm_reward = True, clip_obs = 200.0, clip_reward = 1.)
for i in range(2):
model.learn(total_timesteps = 10_000, log_interval = 1, reset_num_timesteps = False, tb_log_name = "SAC", progress_bar = False)
The difference now is that I get a vectorized environment from the model self.
vec_env = model.get_env()
vec_env = VecNormalize(venv = vec_env, training = True, norm_obs = False, norm_reward = True, clip_obs = 200.0, clip_reward = 1.)
in the hope, that it takes care of the vectorization of all the methods.
Then I almost copied the compute_reward()
method from the panda-gym environment:
def compute_reward(self, achieved_goal, desired_goal, info): """ This is an abstract method from Gymnasium: https://github.com/Farama-Foundation/Gymnasium-Robotics/blob/a35b1c1fa669428bf640a2c7101e66eb1627ac3a/gym_robotics/core.py#L8 """ self.distance = np.linalg.norm(np.array(achieved_goal, dtype = np.float32) - np.array(desired_goal, dtype = np.float32))
reward = -np.array(self.distance > 0.05, dtype = np.float32)
return reward
but the result is still the following:
Traceback (most recent call last):
File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 356, in <module>
main()
File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 342, in main
model = model.learn(total_timesteps = 10_000, log_interval = 1)
File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 307, in learn
return super().learn(
File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/sac/sac.py", line 254, in train
target_q_values = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_q_values
RuntimeError: The size of tensor a (14) must match the size of tensor b (64) at non-singleton dimension 0
Still it does not make sense to me the meaning of the error.
Since I passed a batch size of 64, one tensor will have a batch of size 64. And this makes perfectly sense, so for me 64 "forward passes" are generated and the data collected.
Then there is a tensor with a batch size of 14, which I do not know, where it come from. It cannot be the rewards because given a batch of size of 64, the compute_reward should run 64 times. But it cannot be the next_q_values,
because of the batch_size passed to the class constructor. Even in this case, I would expect 64.
Any help? Thanks
By the way, even my get_observation()
method is almost the same as the method from the panda-gym.
def get_observations(self):
obs = {}
x = self.x0 + self.arm1_sensor.getValue() + self.arm2_sensor.getValue() + self.arm3_sensor.getValue() + self.arm4_sensor.getValue()
y = self.rotation_sensor.getValue()
z = self.inclination_sensor.getValue()
endpoint = self.endpoint.getPosition()
target = self.target.getPosition()
obs['observation'] = np.array([x, y, z], dtype = np.float32)
obs['achieved_goal'] = np.array(endpoint, dtype = np.float32)
obs['desired_goal'] = np.array(target, dtype = np.float32)
return obs
No, you don't need to implement a vectorized environment. The only requirement for your environment to be compatible with HER is to have a vectorized compute_reward
method.
def _compute_reward_not_vectorized(self, achieved_goal, desired_goal, info={}):
...
def compute_reward(self, achieved_goal, desired_goal, info={}):
if input_batched:
rewards = []
for i in range(len(achieved_goal)):
rewards.append(self._compute_reward_not_vectorized(achieved_goal[i], desired_goal[i]))
return np.array(rewards)
else:
return self._compute_reward_not_vectorized(achieved_goal, desired_goal))
EDIT: updated the code to handle non-batched case
Also, since you're developing your custom env, you need to use the env checker, and only once the checks pass, you can try training.
Thanks, but I tried everything (even the solution posted by you). But is now says, that
Traceback (most recent call last):
File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 372, in <module>
main()
File "/home/ubuntu/workspace/src/simulator/webots/controllers/supervisor_controller/supervisor_controller.py", line 338, in main
print(check_env(env))
File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 473, in check_env
_check_returned_values(env, observation_space, action_space)
File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 322, in _check_returned_values
_check_goal_env_compute_reward(obs, env, float(reward), info)
TypeError: only length-1 arrays can be converted to Python scalars
Now, looking at the line:
_check_goal_env_compute_reward(obs, env, float(reward), info)
it expectsfloat(reward)
as argument and not an np.array
.
So I think there should be a bug somewhere, of the check_env
does not match with the last SB3 version.
UPDATE:
Here the env_checker.py
installed on my system:
def _check_goal_env_compute_reward(
obs: Dict[str, Union[np.ndarray, int]],
env: gym.Env,
reward: float,
info: Dict[str, Any],
) -> None:
"""
Check that reward is computed with `compute_reward`
and that the implementation is vectorized.
"""
achieved_goal, desired_goal = obs["achieved_goal"], obs["desired_goal"]
assert reward == env.compute_reward( # type: ignore[attr-defined]
achieved_goal, desired_goal, info
), "The reward was not computed with `compute_reward()`"
achieved_goal, desired_goal = np.array(achieved_goal), np.array(desired_goal)
batch_achieved_goals = np.array([achieved_goal, achieved_goal])
batch_desired_goals = np.array([desired_goal, desired_goal])
if isinstance(achieved_goal, int) or len(achieved_goal.shape) == 0:
batch_achieved_goals = batch_achieved_goals.reshape(2, 1)
batch_desired_goals = batch_desired_goals.reshape(2, 1)
batch_infos = np.array([info, info])
rewards = env.compute_reward(batch_achieved_goals, batch_desired_goals, batch_infos) # type: ignore[attr-defined]
assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
assert rewards[0] == reward, f"Vectorized computation of reward differs from single computation: {rewards[0]} != {reward}"
You need to consider the non-batched case as well. I've updated the code example.
This vectorization seems to be not so easy as expected. I did every change, you suggested, but now I get the following error:
assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
AssertionError: Unexpected shape for vectorized computation of reward: () != (2,)
Which is still not clear to me, what triggers that problem. The reward will be a vector and no more a scalar: ok. But my compute_reward method is not different from other implementations. What is here the problem?
It's hard to say because you're giving traceback without any code. In general, we try to work around an MRE. In fact, it's quite simple: when the method compute_reward
takes a batch of observations as input, it must returns an array of rewards; when it receives a single observation, it must returns the reward as a float.
Thank you very much for your patience. I understood the concept, but if it outputs a float, then I get one error, if it is an array, then an other error. For instance, if I return a float, then I get:
File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 189, in _check_goal_env_compute_reward
assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
AttributeError: 'int' object has no attribute 'shape'
My code is the following:
def _compute_reward_not_vectorized(self, achieved_goal, desired_goal, info = {}):
"""
This is an abstract method from Gymnasium:
https://github.com/Farama-Foundation/Gymnasium-Robotics/blob/a35b1c1fa669428bf640a2c7101e66eb1627ac3a/gym_robotics/core.py#L8
"""
self.distance = np.linalg.norm(np.array(achieved_goal, dtype = np.float32) - np.array(desired_goal, dtype = np.float32))
if abs(self.distance) < 0.05:
reward = 1
else:
reward = 0
return reward
def compute_reward(self, achieved_goal, desired_goal, info = {}):
if any(isinstance(elem, list) for elem in achieved_goal):
rewards = []
for i in range(len(achieved_goal)):
rewards.append(self._compute_reward_not_vectorized(achieved_goal[i], desired_goal[i]))
rewards = np.array(rewards, dtype = np.float32).reshape(len(rewards), 1)
return rewards
else:
rewards = self._compute_reward_not_vectorized(achieved_goal, desired_goal)
return rewards
Now, changing the last line in:
rewards = self._compute_reward_not_vectorized(achieved_goal, desired_goal)
then I get the following error:
assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
AssertionError: Unexpected shape for vectorized computation of reward: (1, 1) != (2,)
I understood the concept, but if it outputs a float, then I get one error, if it is an array, then an other error.
Ok, I think you don't get my point: the function should handle both cases. That's what we mean by vectorized.
import numpy as np
def _compute_reward_not_vectorized(achieved_goal, desired_goal, info = {}):
distance = np.linalg.norm(np.array(achieved_goal) - np.array(desired_goal))
if abs(distance) < 0.05:
reward = 1
else:
reward = 0
return reward
def compute_reward(achieved_goal, desired_goal, info = {}):
if len(achieved_goal.shape) == 2:
rewards = []
for i in range(len(achieved_goal)):
rewards.append(_compute_reward_not_vectorized(achieved_goal[i], desired_goal[i]))
rewards = np.array(rewards)
return rewards
else:
rewards = _compute_reward_not_vectorized(achieved_goal, desired_goal)
return rewards
achieved_goal = np.array([0.1, 0.1, 0.1])
desired_goal = np.array([0.1, 0.1, 0.1])
print(compute_reward(achieved_goal, desired_goal)) # 1
achieved_goal = np.array([[0.1, 0.1, 0.1], [0.2, 0.2, 0.2]])
desired_goal = np.array([[0.1, 0.1, 0.1], [0.3, 0.3, 0.3]])
print(compute_reward(achieved_goal, desired_goal)) # [1 0]
Edit: use arrays
Can you run
check_env()
on it?
Your example works on my PC, but running check_env()
outputs the following error:
assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
AttributeError: 'int' object has no attribute 'shape'
This is the point. Expected is a float
, when reward is one float
. But it expects at the same time an object with shape (2,).
This is the problem
I reopened the question, since the problem persists and my allegedly solution led to more errors later. The question is how is it possible to expect that a float has a shape of (2,)
Solved! For anyone, who has the same issue. The problem is in this checking function:
def _check_goal_env_compute_reward(
obs: Dict[str, Union[np.ndarray, int]],
env: gym.Env,
reward: float,
info: Dict[str, Any],
) -> None:
"""
Check that reward is computed with `compute_reward`
and that the implementation is vectorized.
"""
achieved_goal, desired_goal = obs["achieved_goal"], obs["desired_goal"]
assert reward == env.compute_reward( # type: ignore[attr-defined]
achieved_goal, desired_goal, info
), "The reward was not computed with `compute_reward()`"
achieved_goal, desired_goal = np.array(achieved_goal), np.array(desired_goal)
batch_achieved_goals = np.array([achieved_goal, achieved_goal])
batch_desired_goals = np.array([desired_goal, desired_goal])
if isinstance(achieved_goal, int) or len(achieved_goal.shape) == 0:
batch_achieved_goals = batch_achieved_goals.reshape(2, 1)
batch_desired_goals = batch_desired_goals.reshape(2, 1)
batch_infos = np.array([info, info])
rewards = env.compute_reward(batch_achieved_goals, batch_desired_goals, batch_infos) # type: ignore[attr-defined]
assert rewards.shape == (2,), f"Unexpected shape for vectorized computation of reward: {rewards.shape} != (2,)"
assert rewards[0] == reward, f"Vectorized computation of reward differs from single computation: {rewards[0]} != {reward}"
In the first pass, only obs['achieved_goal']
and obs['desired_goal']
are passed. So they are lists.
But in the second pass, they are casted into np.arrays.
In short: in your compute_reward
, you should check, whether you are getting lists
or numpy arrays
.
Thanks for sharing but I would disagree, goals can't be lists since the observation space (goals keys) are Box. They must array in all cases
❓ Question
I'm having some hard time trying to reimplement my code migrating it to Gymnasium. There are few very interesting questions and answers to this topic, like here, here and here. As far as I understood, a vectorized environment (necessary for HER) is basically an environment, where observations and rewards can be batched, so this is more computationally effective since a batch of trainings will be executed parallel.
So I have the following questions:
obs['achieved_goal']
andobs['desired_goal']
) is passed to that method, there is batch size at all. So I would expect, that the reward is always a scalar number and not an array with dimension (batch_size, 1). Why? Because when I pass a batch size to the class (let's take PPO for example), then it wouldn't pass batch_size x obs to one methodcompute_reward
, but it would call thecompute_reward
as many times as batch_size. Or am I wrong? If I'm wrong, then I do not understand, why thestep()
orget_obs()
methods do not output something with the dimension: (batch_size, obs)Checklist