Farama-Foundation / MO-Gymnasium

Multi-objective Gymnasium environments for reinforcement learning
http://mo-gymnasium.farama.org/
MIT License
290 stars 39 forks source link

Pareto Front generated by Resource Gathering is not optimal #104

Closed Saood1810 closed 2 months ago

Saood1810 commented 2 months ago

I have trained the PQL algorithm, in Resource Gathering environment and the algorithm is obtaining a higher hypervolume score than the true front. It only finds one non dominated solution and that to it is not any of the solutions of the true Pareto front.

import random

SEEDS = [42] # 10 seeds

env=MORecordEpisodeStatistics(mo_gym.make("resource-gathering-v0"), gamma=0.99) ref_point = np.array([-1,-1,-2])

for seed in SEEDS:

print(f"Running experiment with seed {seed}")
env.reset(seed=seed)
random.seed(seed)
np.random.seed(seed)
env.reset(seed=seed)

agent = PQL(
    env,
    ref_point,
    gamma=0.99,
    initial_epsilon=1,
    epsilon_decay_steps=100000,
    final_epsilon=0.1,
    seed=seed,
    project_name="Research Project Logs",
    experiment_name="Pareto Q-Learning in DST",
    log=True,)

pf = agent.train(
    total_timesteps=100000,
    log_every=100,
    action_eval="hypervolume",
    known_pareto_front=env.pareto_front(gamma=0.99),
    ref_point=ref_point,
    eval_env=env,)
print(pf)

from pymoo.visualization.scatter import Scatter from pymoo.problems import get_problem eval_env=MORecordEpisodeStatistics(mo_gym.make("resource-gathering-v0"), gamma=0.99) current_pf=np.array(filter_pareto_dominated(agent._eval_all_policies(eval_env)))

true_pf=eval_env.pareto_front(gamma=0.99) print(f"The true Pareto front {true_pf}") print(f"The current Pareto front {current_pf}") ref_point=np.array([-1,-1,-2]) print(f"Hypervolume of True front {hypervolume(ref_point,true_pf)}") print(f"Hypervolume of Approx front {hypervolume(ref_point,current_pf)}")

Scatter().add(np.array(true_pf)).add(np.array(current_pf)).show()

output: The true Pareto front [array([-0.09320653, 0.78187123, 0.78187123]), array([0. , 0.88638487, 0. ]), array([0. , 0. , 0.90438208])] The current Pareto front [[0. 0.91351724 0. ]] Hypervolume of True front 5.23149517778104 Hypervolume of Approx front 3.8270344734191895

image

Additonally when I visualise the Pareto front with a gamma of 0.9 , i see more non dominated solutions image

LucasAlegre commented 2 months ago

PQL assumes deterministic environments, while resource gathering is stochastic due to the random chance of being killed by the enemies. The PF output by PQL uses only one episode to evaluate the expected return (since it assumes the environment to be deterministic), and therefore, it does not compute the expected return, only the return of a single trajectory.

Regarding the fact that there are more points in the Pareto Front for gamma=0.9, this is correct. The shape of the Pareto Front can change depending on the discount factor used since the Pareto Front is computed w.r.t. the discounted return.

Saood1810 commented 2 months ago

Oh alright! That makes sense. Do you know which discrete action and observation environments, besides DST, I can use to compare performances between the single policy MO Q Learning algorithm and the PQL? I was thinking of DST Mirrored and Four-Room Alternatively, do you know which other multi-policy algorithm I can use with MO Q Learning to test in Resource Gathering and Four Room? I really appreciate your advice. You have been helping me a lot.

LucasAlegre commented 2 months ago

I suggest you look at the gridworld environments: https://mo-gymnasium.farama.org/environments/grid-world/ and check which are deterministic.

For tabular algorithms we only have MOQlearning (which can be transformed in GPI-LS if you set use_gpi_policy=True and weight_selection_algo="gpi-ls"). You can use one of the deep MORL algorithms too.

Saood1810 commented 2 months ago

Hi Again, I ran the MO Q Learning algorithm in Resource Gathering with different weight tuples using a step of 0.1eg(0.1,-0.3,0.6) and here are the non-dominated solutions I got with a gamma of 0.99 [array([0. , 0. , 0.91351724]), array([0. , 0.89533824, 0. ]), array([0. , 0.87752104, 0.87752104])]

The true front consists of 3 solutions and is : [array([-0.09320653, 0.78187123, 0.78187123]), array([0. , 0.88638487, 0. ]), array([0. , 0. , 0.90438208])]

I think 1st and 2nd non-dominated solutions matches with PF even though it is slightly off. I am sure my last one is valid, though, because once I get the discounted reward vector for a certain weight tuple, I apply the filter_Pareto_dominated function, which would remove it if it is dominated.

LucasAlegre commented 2 months ago

How many episodes are you using to evaluate each point? It is possible that you need more

Saood1810 commented 2 months ago

I used the eval_mo() function which only evaluates the policy in 1 episode. Will try the policy_evaluation_mo() function with ten episodes.

Saood1810 commented 2 months ago

Update: so I used ten episodes to evaluate at each point and used the average of the results ...essentially used the policy_evaluation_mo() function with rep set to 10 Here are the results I've got: [array([0. , 0.89533824, 0. ]), array([0. , 0. , 0.91351724]), array([0. , 0.87752104, 0.87752104])] Which is the same results as the ones I had before. I also trained the algorithm ten times with different seeds each time, and have been getting the same rewards. I wonder what could be the issue.

Saood1810 commented 1 month ago

Similarly with a gamma of 0.9 The true front is [array([-0.08269753, 0.22876792, 0.22876792]), array([-0.1260441 , 0.34867844, 0. ]), array([0. , 0. , 0.34867844]), array([-0.04782969, 0.20589113, 0.20589113]), array([-0.04782969, 0.3138106 , 0. ]), array([0. , 0.28242954, 0. ])

while MOQ with linear Scalarised gets [array([0. , 0. , 0.38742048]), array([-0.04304672, 0.28242953, 0.28242953]), array([0. , 0.25418657, 0.25418657]), array([0. , 0.38742048, 0. ]), array([-0.081 , 0.34867843, 0. ]), array([0. , 0.31381059, 0. ])]

MOQ with chebyshev gets [array([0. , 0. , 0.38742048]), array([0. , 0.38742048, 0. ]), array([0. , 0.31381059, 0.31381059])]

LucasAlegre commented 1 month ago

It seems you are actually learning all policies in the Pareto front.

The value vectors are not the same as in the true PF clearly because 10 samples is no enough to approximate the expected value of the return. Imagine you want to approximate the mean height of some population. Randomly sampling 10 people and computing their average height would very likely not result in the true mean of the population.