LucasAlegre / morl-baselines

Multi-Objective Reinforcement Learning algorithms implementations.
https://lucasalegre.github.io/morl-baselines
MIT License
271 stars 44 forks source link

Help Regarding GPILSContinuousAction Model Convergence #71

Closed arshad171 closed 10 months ago

arshad171 commented 10 months ago

Hi,

I am fairly new to multi-objective RL, but I have quite a bit of experience working with single-objective tasks.

I was trying to train the GPILSContinousAction model on one of my quadrotor tasks, but the model doesn't seem to converge completely.

The task includes two objectives, and the single-objective PPO (ESR approach) had no problems converging, while the multi-objective RL seems to have a hard time. Initially, I tried using PGMORL since I was familiar with the PPO algorithm, but the performance was lacklustre with poor convergence. With GPILSContinuousAction model the reward seems to stagnate at $-39$ while the highest reward is 0.

I have also tried setting the weight supports prior to training to debug the convergence by having the weights as [[1.0, 0.0], [1.0, 0.25]], with the first weight it essentially converts the problem into a single-objective, however neither of the agents trained converge!

image image

I would really appreciate your help!

Regards, Arshad

LucasAlegre commented 10 months ago

Hi Arshad,

From your plots, it seems the algorithm converged. Do you mean that it converged to a suboptimal solution? To what solution has the single-objective PPO converged to, and over which reward weights?

arshad171 commented 10 months ago

Hi @LucasAlegre ,

I am not sure if the solution is close to optimal, because the reward_0 is the distance of the quadrotor in meters from the target point. So $-39$ seems quite high. Although I have the reference point set to $[-100, -100]$, I am not sure if that is added to the eval metrics displayed on W&B.

The single-objective PPO managed to converge for both cases with reward weights [1, 0] and [1 0.25] achieving a maximum reward of $-0.2$ at the end of training (graph below).

image
LucasAlegre commented 10 months ago

It may be that there was not enough exploration, and the policy converged to the -39 policy because it couldn't find another action while exploring. -39 reward does not look so bad by looking that the PPO agent performs around -400 for a long time during training. I would need to check more metrics and look more closely at your environment to understand it better.

The reference point is only used to compute the hypervolume metric.

arshad171 commented 10 months ago

I tried experimenting with the net_arch and buffer_size but to no avail.

Do you suggest tweaking the action noise parameters, policy_noise and noise_clip? Currently, I have left them as defaults.

The way I train the model

        algo = GPILSContinuousAction(
            env=eval_env,
            project_name="mo-nav-err",
            log=True,
            seed=0,
            buffer_size=int(1e6),
        )

        algo.set_weight_support(WEIGHT_SUPPORTS)

        pf = algo.train(
            total_timesteps=int(NUM_EPISODES),
            eval_env=eval_env,
            ref_point=ref_point,
            known_pareto_front=None,
        )

And my multi-objective environment is just a wrapper to the single-objective env

class MONavigationAviaryErr(NavigationAviaryErr, EzPickle):
    def __init__(self, **kwargs):
        # discard the attr
        if kwargs.get("render_mode"):
            self.metadata["render_modes"] = [kwargs.pop("render_mode")]
            # kwargs.pop("render_mode")

        super().__init__(**kwargs)
        EzPickle.__init__(self, **kwargs)

        self.reward_space = Box(low=-np.inf, high=np.inf, shape=(2,))
        self.reward_dim = 2

        # redefine spaces, use "gymnasium" API instead of "gym"
        # (required for mo-gymnasium)
        self.observation_space = gymnasium.spaces.Box(low=self.observation_space.low, high=self.observation_space.high, shape=self.observation_space.shape)
        self.action_space = gymnasium.spaces.Box(low=self.action_space.low, high=self.action_space.high, shape=self.action_space.shape)

        # if not self.eval_reward:
        #     self.EPISODE_LEN_SEC = 1

    def step(self, action):
        observation, reward, done, info = super().step(action)
        vec_reward = np.array([info["nav_rew"], info["err_rew"]])

        return observation, vec_reward, done, done, info

    def reset(self, **kwargs):
        self._resetLastError()
        obs = super().reset()

        # obs, info
        return obs, {}
LucasAlegre commented 10 months ago

Maybe adding more policy_noise might help.

But I noticed something. Are you passing the same instance of your environment to both env and eval_env? You must pass a copy of your env to eval_env, because the eval_env is reset in the middle of the episodes to evaluate the agent, and might affect training. Something like:

env = make_your_env()
eval_env = make_your_env()
algo = GPILSContinuousAction(env, ...)
algo.train(eval_env=eval_env,...)
arshad171 commented 10 months ago

Are you passing the same instance of your environment to both env and eval_env? You must pass a copy of your env to eval_env, because the eval_env is reset in the middle of the episodes to evaluate the agent, and might affect training.

oh, I didn't realize that. Yes, I had the same env passed for training and testing.

I tried creating two instances of envs as you suggested. But the results still look similar, albeit not as stagnant as earlier.

I will also try increasing policy_noise.

image
ffelten commented 10 months ago

@arshad171 how is the progress on this?

arshad171 commented 10 months ago

@LucasAlegre , @ffelten apologies for the lack of response. I have had no luck with this. I have tried everything possible to observe convergence, but nothing seems to work. I tried to significantly simplify the task by:

But none of it seems to help. Here are the results, the ideal reward is 0 (in meters), so -68 is quite high even though the agent is initialised close to its destination.

image image

You can find the code here (if you would like to have a look)

I would really appreciate your help!

LucasAlegre commented 9 months ago

Hi @arshad171,

Just a small comment: "eval/discounted_vec_0" is the discounted sum of the episode rewards. Hence, it is impossible to reach 0 total reward as the agent receives a penalty until it reaches the goal. Can you render your environment to see if the agent is moving toward the goal?

arshad171 commented 8 months ago

Hi @LucasAlegre, thank you for your reply.

I agree that the discounted reward can never approach 0. But I would expect it be close to 0, similar to the single-objective RL case (https://github.com/LucasAlegre/morl-baselines/issues/71#issuecomment-1766112075), where the reward converged to -0.2. In the context of the problem, the reward directly relates to the distance (in meters) from the goal. So a value such as 34 or 64 (what I observed after training) is absurd.

Can you render your environment to see if the agent is moving toward the goal?

Unfortunately, I have a few compatibility issues between gym, gymnasium and the custom env which doesn't render the graphics. However, I did manage to log the coordinates of the quadrotor and plot the trajectory (figure below). The task for each of the agents is to start off at $[2, 0, 0]$ and hover at $[0, 0, 1]$. But neither of the agents seems to be accomplishing the task. Agent 0 had the reward weights set to $[1, 0]$, essentially a single-objective RL task and agent 1 had a tiny fraction of the second objective $[0.75, 0.25]$.

image