Closed qgallouedec closed 1 year ago
From https://github.com/DLR-RM/stable-baselines3/issues/1335#issuecomment-1434818965
import numpy as np
from stable_baselines3 import HerReplayBuffer, SAC
import gym
from gym import spaces
class TestEnv(gym.GoalEnv):
def __init__(self):
self.observation_space = spaces.Dict(
dict(
an_observation_1=spaces.Box(0, 1, (1,), dtype=np.float32),
an_observation_2=spaces.Box(0, 1, (1,), dtype=np.float32),
achieved_goal=spaces.Box(0, 1, (1,), dtype=np.float32),
desired_goal=spaces.Box(0, 1, (1,), dtype=np.float32),
)
)
self.action_space = spaces.Box(0, 1, (1,), dtype=np.float32)
self.current_step = 0
self.ep_length = 10
def reset(self):
self.current_step = 0
state = self._generate_next_state()
return state
def step(self, action):
obs = self._generate_next_state()
self.current_step += 1
done = self.current_step >= self.ep_length
return obs, -1, done, {}
def _generate_next_state(self):
state = {}
for k, space in self.observation_space.spaces.items():
state[k] = space.sample()
return state
def render(self, mode: str = "human") -> None:
pass
def compute_reward(self, achieved_goal, desired_goal, info):
return np.array(-1, np.float32)
env = TestEnv()
model = SAC(
"MultiInputPolicy",
env,
replay_buffer_class=HerReplayBuffer,
learning_starts=1000,
verbose=1,
)
model.learn(3000, progress_bar=True)
The problem comes from the compute reward method. It has to be vectorized.
def compute_reward(self, achieved_goal, desired_goal, info):
return np.zeros(achieved_goal.shape[0], np.float32)
@araffin I'm pretty sure that this error will occur often. I would add a check in env checker, what do you think?
@araffin I'm pretty sure that this error will occur often. I would add a check in env checker, what do you think?
Let me guess, you're going to tell me it's already done ;)
The problem comes from the compute reward method. It has to be vectorized.
def compute_reward(self, achieved_goal, desired_goal, info): return np.zeros(achieved_goal.shape[0], np.float32)
I got stucked at the same point.
Moving to gymnasium I have to vectorized the compute_reward
method as expected by SB3.
The problem here is that, since the achieved_goal
array contains 3 scalars, the resulting reward will be the same value repeated three times, like:
[0, 0, 0]
But running the checker it leads to the following error:
File "/home/ubuntu/Downloads/deepbots-env/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py", line 322, in _check_returned_values
_check_goal_env_compute_reward(obs, env, float(reward), info)
TypeError: only length-1 arrays can be converted to Python scalars
How should the reward look like? Expected is a float number (the reward), but due to vectorization, compute_reward will output an array.
Yes the reward should be an array. The function should take a batch of achieved/desired rewards and return an array (1 dim) of rewards
Thanks! I set
learning_starts
at 600 as my environment has a time limit of 300. The error mentioned is solved, however, I get the following:Originally posted by @tudorjnu in https://github.com/DLR-RM/stable-baselines3/issues/1335#issuecomment-1434713057