Closed Saood1810 closed 1 month ago
What environment is this? Can you share the code?
I am using the DST environment. I wanna track testing performance by applying learnt policy in the environment and obtaining episodic returns. I see now there's a function policy_evaluation_mo() that actually does this and returns average episodic returns. I tried it and it works, Anyway, my code was trying to get the episode returns for each episode and then working out the average returns per episode. Im doing a Comparitive analysis of selected MORL algorithms, and MO Q Learning function is one of the algorithms im using. I am still figuring out a way to compare performance of the algorithms since some are multi-policy and some are single policy but I thought to evaluate a single policy it would be nice to obtain the returns for each episode and plot them on a learning curve. Therafter work out the avg episodic returns. Perhaps there could be an error on my side. I just started learning MORL :). Ignore the traceback I pasted above. Here is the correct one for my code Traceback: TypeError Traceback (most recent call last) in <cell line: 20>() 25 while not done: 26 # Use the wrapper function to convert obs to tuple before calling eval ---> 27 next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(obs)) 28 # Take the action in the environment 29 episode_reward += vector_reward
/usr/local/lib/python3.10/dist-packages/morl_baselines/single_policy/ser/mo_q_learning.py in eval(self, obs, w) 155 """Greedily chooses best action using the scalarization method""" 156 t_obs = tuple(obs) --> 157 if t_obs not in self.q_table: 158 return int(self.env.action_space.sample()) 159 scalarized = np.array(
TypeError: unhashable type: 'numpy.ndarray'
Code: env = MORecordEpisodeStatistics(mo_gym.make("deep-sea-treasure-v0"), gamma=0.9) eval_env = mo_gym.make("deep-sea-treasure-v0") scalarization = tchebicheff(tau=4.0, reward_dim=2) weights = np.array([0.3, 0.7]) agent = MOQLearning(env, scalarization=scalarization,learning_rate=0.1, weights=weights, log=True)
agent.train( total_timesteps=1000, start_time=time.time(), eval_freq=100, eval_env=eval_env, )
obs = eval_env.reset()
num_episodes = 100 total_rewards = []
for episode in range(num_episodes): obs = eval_env.reset() done = False episode_reward = 0
while not done:
next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(obs))
# Take the action in the environment
episode_reward += vector_reward
# Move to the next state
obs = next_obs
total_rewards.append(episode_reward)
print(f"Episode {episode+1}: Total Reward = {episode_reward}")
average_reward = np.mean(total_rewards) print(f"Average Reward over {num_episodes} episodes: {average_reward}")
I also modified this line next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(obs)) to: next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(tuple(obs))) and it gives same error
Hi,
The problem is in the line:
obs = eval_env.reset()
which should be:
obs, info = eval_env.reset()
This should fix your problem ;)
@Saood1810 is this fixed?
Yes ,sorry for not replying. Appreciate your help! Also there was a small issue on my side with the while not done loop. I didn't update the done value which meant loop never ended. The code for that segment should read:
while not done: next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(obs))
episode_reward += vector_reward
done = terminated or truncated
# Move to the next state
obs = next_obs
There is an issue with the implementation of the eval function. The issue appears to be that obs is a NumPy array, which is an unhashable type in Python and cannot be used as a key in a dictionary (self.q_table). I think the way u converted to tuple and called it t_obs still doesn't solve the issue strangely.
Traceback is as follows: in <cell line: 38>()
40
41 # Evaluate policy
---> 42 rewards = evaluate_policy(agent, eval_env)
43
44 # Calculate hypervolume
1 frames in evaluate_policy(policy, eval_env, num_episodes)
25 episoderewards = []
26 while not done:
---> 27 action = policy.eval(obs)
28 obs, reward, done, = eval_env.step(action)
29 episode_rewards.append(reward)
/usr/local/lib/python3.10/dist-packages/morl_baselines/single_policy/ser/mo_q_learning.py in eval(self, obs, w) 155 """Greedily chooses best action using the scalarization method""" 156 t_obs = tuple(obs) --> 157 if t_obs not in self.q_table: 158 return int(self.env.action_space.sample()) 159 scalarized = np.array(
TypeError: unhashable type: 'numpy.ndarray'