Closed Entongsu closed 2 years ago
What is your model? Have you tried with stochastic evaluation?
from stable_baselines3.common.evaluation import evaluate_policy
evaluate_policy(model, env, deterministic=False)
Thank you for your reply. I have used PPO for these and set deterministic=False
while training. In the previous evaluation, I just used deterministic evaluation. I have found that if I set the deterministic=False
, the problem has been resolved. I am still confused why the setting of deterministic
has such a big effect on the result. https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html?highlight=deterministic%20evaluation#how-to-evaluate-an-rl-algorithm, setting deterministic=True
can obtain better performance, but the reward is pretty low, and the performance is not good enough. Could you give me some hints on this? Thank you.
It's hard to say without knowing your environment. But the most likely is that your environment is partially observable. Under certain conditions in a POMDP, a stochastic agent can achieve better results than a deterministic one (I don't have a reference in mind, if someone has one I'm willing to take it).
To give you an intuition, imagine an agent who has to get out of a maze. To get out, it has to go left at the first intersection, then right at the second. Suppose that it is getting dark and from the agent's point of view, it is not possible to distinguish these two intersections. So, a deterministic agent will choose either Left-Left or Right-Right, and will never get out of the maze. Conversely, a stochastic agent can choose Left-Right and get out.
Hi, I've got the same problem with a completely observable custom environment. As far as I can tell, this happens just with actor-critic algorithms and it's due to the fact that when using a MultiInputActorCriticPolicy
if deterministic=False
then we will sample from the actions distribution, while if deterministic=True
then the mode of the distribution is returned.
For some reason, no matter what you do, if at test time deterministic=True
the distribution will always be a gaussian with mean ~=0 (let's say in the [-0.001, 0.001] range), and it will basically not vary. Hence, actions will be inconsistent with those sampled with deterministic=False
.
I'm still investigating this though, so as I find new hints I will edit this comment.
I wonder whether it is possible to give a penalty on the std?
Hi, I've got the same problem with a completely observable custom environment. As far as I can tell, this happens just with actor-critic algorithms and it's due to the fact that when using a
MultiInputActorCriticPolicy
ifdeterministic=False
then we will sample from the actions distribution, while ifdeterministic=True
then the mode of the distribution is returned.For some reason, no matter what you do, if at test time
deterministic=True
the distribution will always be a gaussian with mean ~=0 (let's say in the [-0.001, 0.001] range), and it will basically not vary. Hence, actions will be inconsistent with those sampled withdeterministic=False
.I'm still investigating this though, so as I find new hints I will edit this comment.
The behavior policies you learn during training are stochastic (e.g. to explore the MDP). Setting deterministic=False
when doing inference act as your policy is a Dirac measure for the mode (hence the mode is the only value sampled).
It's hard to say without knowing your environment. But the most likely is that your environment is partially observable. Under certain conditions in a POMDP, a stochastic agent can achieve better results than a deterministic one (I don't have a reference in mind, if someone has one I'm willing to take it).
To give you an intuition, imagine an agent who has to get out of a maze. To get out, it has to go left at the first intersection, then right at the second. Suppose that it is getting dark and from the agent's point of view, it is not possible to distinguish these two intersections. So, a deterministic agent will choose either Left-Left or Right-Right, and will never get out of the maze. Conversely, a stochastic agent can choose Left-Right and get out.
While I was browsing issues to find discussion about this topic, I'd like to add a word on this accurate answer - there is a mathematical theorem which states the following (see O. Sigaud, O. Buffet - Markov Decision Processes in Artificial Intelligence (2010) for instance):
There exist POMDPs for which the best stochastic adapted policy can be arbitrary better than the best deterministic adapted policy.
Question
When I use stable baselines3 for my custom environment, I have found even though the reward in training is pretty high, the reward in the evaluation is low. I am not sure why this happens.
Additional context
For example, I have obtained the mean reward in training is about 1500, but in the evaluation, the mean reward is only 400 or lower. I have used random seed 100 or 500, but I still get the same result. In my environment, the workers are ten, and the bs is 500.
Checklist
I have checked the document of the stable baselines and done some hyperparameter fine-tuning. I have checked there is no similar issues.