LucasAlegre / morl-baselines

Multi-Objective Reinforcement Learning algorithms implementations.
https://lucasalegre.github.io/morl-baselines
MIT License
271 stars 44 forks source link

Issue with MO Q Learning Function eval function #105

Closed Saood1810 closed 1 month ago

Saood1810 commented 1 month ago

There is an issue with the implementation of the eval function. The issue appears to be that obs is a NumPy array, which is an unhashable type in Python and cannot be used as a key in a dictionary (self.q_table). I think the way u converted to tuple and called it t_obs still doesn't solve the issue strangely.

Traceback is as follows: in <cell line: 38>() 40 41 # Evaluate policy ---> 42 rewards = evaluate_policy(agent, eval_env) 43 44 # Calculate hypervolume

1 frames in evaluate_policy(policy, eval_env, num_episodes) 25 episoderewards = [] 26 while not done: ---> 27 action = policy.eval(obs) 28 obs, reward, done, = eval_env.step(action) 29 episode_rewards.append(reward)

/usr/local/lib/python3.10/dist-packages/morl_baselines/single_policy/ser/mo_q_learning.py in eval(self, obs, w) 155 """Greedily chooses best action using the scalarization method""" 156 t_obs = tuple(obs) --> 157 if t_obs not in self.q_table: 158 return int(self.env.action_space.sample()) 159 scalarized = np.array(

TypeError: unhashable type: 'numpy.ndarray'

LucasAlegre commented 1 month ago

What environment is this? Can you share the code?

Saood1810 commented 1 month ago

I am using the DST environment. I wanna track testing performance by applying learnt policy in the environment and obtaining episodic returns. I see now there's a function policy_evaluation_mo() that actually does this and returns average episodic returns. I tried it and it works, Anyway, my code was trying to get the episode returns for each episode and then working out the average returns per episode. Im doing a Comparitive analysis of selected MORL algorithms, and MO Q Learning function is one of the algorithms im using. I am still figuring out a way to compare performance of the algorithms since some are multi-policy and some are single policy but I thought to evaluate a single policy it would be nice to obtain the returns for each episode and plot them on a learning curve. Therafter work out the avg episodic returns. Perhaps there could be an error on my side. I just started learning MORL :). Ignore the traceback I pasted above. Here is the correct one for my code Traceback: TypeError Traceback (most recent call last) in <cell line: 20>() 25 while not done: 26 # Use the wrapper function to convert obs to tuple before calling eval ---> 27 next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(obs)) 28 # Take the action in the environment 29 episode_reward += vector_reward

/usr/local/lib/python3.10/dist-packages/morl_baselines/single_policy/ser/mo_q_learning.py in eval(self, obs, w) 155 """Greedily chooses best action using the scalarization method""" 156 t_obs = tuple(obs) --> 157 if t_obs not in self.q_table: 158 return int(self.env.action_space.sample()) 159 scalarized = np.array(

TypeError: unhashable type: 'numpy.ndarray'

Code: env = MORecordEpisodeStatistics(mo_gym.make("deep-sea-treasure-v0"), gamma=0.9) eval_env = mo_gym.make("deep-sea-treasure-v0") scalarization = tchebicheff(tau=4.0, reward_dim=2) weights = np.array([0.3, 0.7]) agent = MOQLearning(env, scalarization=scalarization,learning_rate=0.1, weights=weights, log=True)

agent = MOQLearning(env, scalarization=scalarization, weights=weights,learning_rate=0.1,gamma=0.99,intila_epsilon=1,final_epsilon=0.1,epsilon_decay_steps=1000, log=True)

agent.train( total_timesteps=1000, start_time=time.time(), eval_freq=100, eval_env=eval_env, )

obs = eval_env.reset()

num_episodes = 100 total_rewards = []

for episode in range(num_episodes): obs = eval_env.reset() done = False episode_reward = 0

while not done:
    next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(obs))
    # Take the action in the environment
    episode_reward += vector_reward
    # Move to the next state
    obs = next_obs

total_rewards.append(episode_reward)
print(f"Episode {episode+1}: Total Reward = {episode_reward}")

Calculate average reward over all episodes

average_reward = np.mean(total_rewards) print(f"Average Reward over {num_episodes} episodes: {average_reward}")

print(eval_mo(agent, env=eval_env, w=weights))

I also modified this line next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(obs)) to: next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(tuple(obs))) and it gives same error

LucasAlegre commented 1 month ago

Hi,

The problem is in the line:

obs = eval_env.reset()

which should be:

obs, info = eval_env.reset()

This should fix your problem ;)

ffelten commented 1 month ago

@Saood1810 is this fixed?

Saood1810 commented 1 month ago

Yes ,sorry for not replying. Appreciate your help! Also there was a small issue on my side with the while not done loop. I didn't update the done value which meant loop never ended. The code for that segment should read:

while not done: next_obs, vector_reward, terminated, truncated, info = env.step(agent.eval(obs))

Take the action in the environment

    episode_reward += vector_reward
    done = terminated or truncated
    # Move to the next state
    obs = next_obs