High reward in training and low reward in evaluation

Entongsu commented 2 years ago

Question

When I use stable baselines3 for my custom environment, I have found even though the reward in training is pretty high, the reward in the evaluation is low. I am not sure why this happens.

Additional context

For example, I have obtained the mean reward in training is about 1500, but in the evaluation, the mean reward is only 400 or lower. I have used random seed 100 or 500, but I still get the same result. In my environment, the workers are ten, and the bs is 500.

Checklist

I have checked the document of the stable baselines and done some hyperparameter fine-tuning. I have checked there is no similar issues.

qgallouedec commented 2 years ago

What is your model? Have you tried with stochastic evaluation?

from stable_baselines3.common.evaluation import evaluate_policy

evaluate_policy(model, env, deterministic=False)

Entongsu commented 2 years ago

Thank you for your reply. I have used PPO for these and set deterministic=False while training. In the previous evaluation, I just used deterministic evaluation. I have found that if I set the deterministic=False, the problem has been resolved. I am still confused why the setting of deterministic has such a big effect on the result. https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html?highlight=deterministic%20evaluation#how-to-evaluate-an-rl-algorithm, setting deterministic=True can obtain better performance, but the reward is pretty low, and the performance is not good enough. Could you give me some hints on this? Thank you.

qgallouedec commented 2 years ago

It's hard to say without knowing your environment. But the most likely is that your environment is partially observable. Under certain conditions in a POMDP, a stochastic agent can achieve better results than a deterministic one (I don't have a reference in mind, if someone has one I'm willing to take it).

To give you an intuition, imagine an agent who has to get out of a maze. To get out, it has to go left at the first intersection, then right at the second. Suppose that it is getting dark and from the agent's point of view, it is not possible to distinguish these two intersections. So, a deterministic agent will choose either Left-Left or Right-Right, and will never get out of the maze. Conversely, a stochastic agent can choose Left-Right and get out.

fmalato commented 1 year ago

Hi, I've got the same problem with a completely observable custom environment. As far as I can tell, this happens just with actor-critic algorithms and it's due to the fact that when using a MultiInputActorCriticPolicy if deterministic=False then we will sample from the actions distribution, while if deterministic=True then the mode of the distribution is returned.

For some reason, no matter what you do, if at test time deterministic=True the distribution will always be a gaussian with mean ~=0 (let's say in the [-0.001, 0.001] range), and it will basically not vary. Hence, actions will be inconsistent with those sampled with deterministic=False.

I'm still investigating this though, so as I find new hints I will edit this comment.

Entongsu commented 1 year ago

I wonder whether it is possible to give a penalty on the std?

ReHoss commented 1 year ago

Hi, I've got the same problem with a completely observable custom environment. As far as I can tell, this happens just with actor-critic algorithms and it's due to the fact that when using a MultiInputActorCriticPolicy if deterministic=False then we will sample from the actions distribution, while if deterministic=True then the mode of the distribution is returned.

For some reason, no matter what you do, if at test time deterministic=True the distribution will always be a gaussian with mean ~=0 (let's say in the [-0.001, 0.001] range), and it will basically not vary. Hence, actions will be inconsistent with those sampled with deterministic=False.

I'm still investigating this though, so as I find new hints I will edit this comment.

The behavior policies you learn during training are stochastic (e.g. to explore the MDP). Setting deterministic=False when doing inference act as your policy is a Dirac measure for the mode (hence the mode is the only value sampled).

It's hard to say without knowing your environment. But the most likely is that your environment is partially observable. Under certain conditions in a POMDP, a stochastic agent can achieve better results than a deterministic one (I don't have a reference in mind, if someone has one I'm willing to take it).

To give you an intuition, imagine an agent who has to get out of a maze. To get out, it has to go left at the first intersection, then right at the second. Suppose that it is getting dark and from the agent's point of view, it is not possible to distinguish these two intersections. So, a deterministic agent will choose either Left-Left or Right-Right, and will never get out of the maze. Conversely, a stochastic agent can choose Left-Right and get out.

While I was browsing issues to find discussion about this topic, I'd like to add a word on this accurate answer - there is a mathematical theorem which states the following (see O. Sigaud, O. Buffet - Markov Decision Processes in Artificial Intelligence (2010) for instance):

There exist POMDPs for which the best stochastic adapted policy can be arbitrary better than the best deterministic adapted policy.

DLR-RM / stable-baselines3