LucasAlegre / morl-baselines

Multi-Objective Reinforcement Learning algorithms implementations.
https://lucasalegre.github.io/morl-baselines
MIT License
271 stars 44 forks source link

Why is PCN only working for deterministic envs? #92

Closed MikhailTerekhov closed 4 months ago

MikhailTerekhov commented 7 months ago

Hi,

I'm benchmarking some of the algorithms in this repository, and I noticed that the README mentions that the current PCN implementation only works for environments with deterministic transitions. However, I don't see an issue with the code that would make it unsuitable for stochastic envs. If there is still such an issue, what would be a fix for it?

The only thing that I can think of is evaluation: if the transitions are stochastic, then we have to sample multiple rollouts to find the average reward for each policy and better approximate metrics such as hypervolume. Was that what you had in mind?

Thanks!

LucasAlegre commented 7 months ago

Hi @MikhailTerekhov !

You can run the PCN implementation in stochastic environments without errors, which may even result in reasonable policies. However, the way the method is designed assumes deterministic state transitions and may cause it not to be able to find good policies in stochastic environments.

Intuitively, PCN is conditioned on a desired return vector and tries to predict the actions that lead to this desired return. However, if the environment is stochastic, the agent will obtain a different return at each episode, and PCN does not take this into account when trying to predict the action most likely to result in the desired return.

@mathieu-reymond (the author of PCN) may correct me if what I said is not precise :)

MikhailTerekhov commented 7 months ago

Thank you for the quick response!

In principle, one could argue that if the environment transitions are stochastic, reward conditioning just learns to act based on the provided expected return, and the training data with random returns at each episode is just noisy, much like in supervised learning it usually is. Would be curious to hear what @mathieu-reymond thinks about this too!

mathieu-reymond commented 4 months ago

Wow it seems I completely missed this, sorry about that.

PCN keeps a replay buffer based on trajectory performance. If you have a stochastic environment, you could get a lucky trajectory, with a (rarely occuring) high return. This trajectory will be kept in the buffer, while other trajectories (more probable but with lower returns) will be removed.

Here is a simple example:

(s1, a1) -> (s2, r=1)
(s1, a2) -> (s3, r=5) with probability 0.1; (s4, r=0) with probability 0.9

Transition (s1, a2, s3, r=5) will be kept in the buffer, and the conditional network will be trained to execute a2 in s1. But in expectation, this is actually worse than executing a1 in s1.

So the main difference with supervised learning here is that your dataset changes over time.

In practice, we have experimented with stochastic MDPs and, if your stochasticity is limited (noise on actuators), then PCN performs fine. But in cases such as above (or similarly, MDPs with unlikely but catastrophical outcomes), then PCN will not learn the optimal policy (in terms of expected returns).

MikhailTerekhov commented 4 months ago

Thank you for the explanation, it's much clearer!