LucasAlegre / morl-baselines

Multi-Objective Reinforcement Learning algorithms implementations.
https://lucasalegre.github.io/morl-baselines
MIT License
294 stars 47 forks source link

Help Regarding Interpreting PGMORL Convergence #67

Closed arshad171 closed 1 year ago

arshad171 commented 1 year ago

Hi,

I am a novice in the multi-objective RL realm, albeit I have quite a bit of experience working with single-objective RL.

I started off with single-objective/regular RL on one of my projects involving a drone performing a specific task. I was employing PPO (single-objective from stable_baselines3) and with a few experiments, the algorithm converged decently with the ESR - Expected Scalarized Return approach, i.e. first scalarize the returns (weighted sum) and then the expectation.

Then, I was curious to try out multi-objective RL as it made sense to tune one of the objectives I had in my reward function. So, I converted my environment to a multi-objective one by simply extending from the base environment and redefining the reward function. However, on training an array of agents using PGMORL, I observed that none of the agents managed to converge for either of the objectives even after training for a really long time (1e7 timesteps). The entropy graph looked startling to me, the fact that the policy entropy keeps going up and down, ideally it should be reducing to some lower value. See the entropy graph below. This is just one of the individuals, though it was the same case for all the learned policies.

I figured it could have been an underfit scenario and tried expanding the network archs, but it not help.

image

I then reverted to testing out one of the examples provided by PGMORL to observe the results. I ran the halfcheetah example as follows (default parameters listed here: #43 for PGMOL halfcheetah), except that I changed the origin to [-0, -0]

python benchmark/launch_experiment.py --algo pgmorl --env-id mo-halfcheetah-v4 --num-timesteps 5000000 --gamma 0.99 --ref-point -0 -0 --auto-tag True --seed 0 --init-hyperparams "project_name:'mo-halfcheetah'"

And the results I observed are quite similar. The entropy loss keeps fluctuating back and forth.

image

more results for half-cheetah

Do these results imply that the algorithm is unable to converge? Or should I just run the training for even longer?

Thanks, Arshad

ffelten commented 1 year ago

Hello Arshad,

Pretty cool to see even more people trying out MORL :-).

Do these results imply that the algorithm is unable to converge? Or should I just run the training for even longer?

In MORL in general, I usually first have a look at multi-objective metrics more than the training metrics. These are contained in the eval/ panels, e.g. eval/hypervolume. If you see an hypervolume going up, it is a good sign that you are finding new policies on the Pareto front.

Regarding what you see in the entropy, it is kind of normal for PGMORL; at each iteration, the algorithm chooses a snapshot model and a new weight vector to train on for each worker (we train pop_size workers per iteration). Thus, you will not see a smooth entropy going down since the workers start from "new" snapshots every time.

except that I changed the origin to [-0, -0]

I would be careful with this, halfcheetah contains a negative objective (the energy consumption to minimize), while PGMORL does make the assumption that both objectives live in the positive quadrant, i.e. all objectives are positive. This is why we have origin in the first place.

Now, for PGMORL in itself; the algorithm is based on PPO, which is usually less sample efficient than newer kind of algorithms such as SAC or TD3. This means that PGMORL can be quite sample inefficient. Moreover, PPO is known to be sensitive to hyperparameters so you might need to fine tune these as well. I would suggest relying on CAPQL or GPI-LS for more sample efficient and more robust algorithms in the continuous cases in general.

arshad171 commented 1 year ago

@ffelten Thank you for the quick response!

Regarding what you see in the entropy, it is kind of normal for PGMORL; at each iteration, the algorithm chooses a snapshot model and a new weight vector to train on for each worker (we train pop_size workers per iteration). Thus, you will not see a smooth entropy going down since the workers start from "new" snapshots every time.

This makes sense!

except that I changed the origin to [-0, -0] I would be careful with this, halfcheetah contains a negative objective (the energy consumption to minimize), while PGMORL does make the assumption that both objectives live in the positive quadrant, i.e. all objectives are positive. This is why we have origin in the first place.

This was the underlying issue! One of the objectives had a negative reward, both in the case of halfcheetah and in my drone environment as well.