Use deterministic policies when evaluating PCN

LucasAlegre / morl-baselines

Multi-Objective Reinforcement Learning algorithms implementations.

https://lucasalegre.github.io/morl-baselines

MIT License

271 stars 44 forks source link

Use deterministic policies when evaluating PCN #75

Closed vaidas-sl closed 10 months ago

vaidas-sl commented 10 months ago

According to PCN paper, "at execution time — i.e., after the training process — we use a deterministic policy by systematically selecting the action with the highest confidence" however current implementation always uses sampled actions

ffelten commented 10 months ago

I'm letting Lucas review this since he's more knowledgeable about this algorithm. Makes sense to me but surely impacts performance.

LucasAlegre commented 10 months ago

Hi @vaidas-sl, thanks for the PR!

Although in the original PCN code (https://github.com/mathieu-reymond/pareto-conditioned-networks/blob/main/pcn/pcn.py#L186C69-L186C69) the evaluation during training uses the stochastic policy, they evaluate using a deterministic policy after training.

I think it makes sense that the evaluation results that we report are obtained using the deterministic policy, as it is the policy that matters at the end of training.