Closed xlnwel closed 4 years ago
It's just a different name for discount, the probability of the episode continuing. For episodic tasks, the agent needs to learn this to take it into account during imagination training.
Hi,
Thanks for answering but I still want to know if pcont
is useful in practice? I noticed you did not use it by default.
Best,
Sherwin
Some environments always have a discount factor of 1, others have a discount factor of 0 at the terminal step. In the first case, the agent learns as if the episode goes on indefinitely. In the latter case, the agent tries to achieve goals before the discount factor goes to zero.
The DeepMind Control Suite environments end due to time limit, so their discount factor is always 1. Thus, an agent can learn as if it is looking at a smaller part of an infinitely long episode. For example, a Q-learning agent would always bootstrap against the future Q-value.
Some other environments have episodes that terminate before the time limit. For example, the agent might die. This means the discount factor is 0 at the episode and and the agent should consider it as a termination. For example, a Q-learning agent would bootstrap against zero at those steps.
The Dreamer code base supports episodes with such early termination by predicting the discount factor, which is called pcont
for "probability of continuing". It is disabled the defaults are for the DeepMind Control Suite environments. However, it might be useful for other tasks.
I see. Thank you for the detailed explanation:-)
Hi @danijar,
Why do you multiply the pcont
by the discount factor at this line?
This is to translate from the discount factor of the environment to the discount factor of the agent. The environment simply reports the probability of the episode continuing, which is typically one for normal time steps and zero for episode ends. However, many RL algorithms including Dreamer use additional discounting to reduce the variance of value estimates. Multiplying by the agent's discount factor of 0.99 turns the ones and zeros from the environment into 0.99s and zeros for learning by the agent.
Hi @danijar
Sorry, I still haven't got the idea. Shouldn't the discount factor be applied to the next time step? As far as I understand, pcont_pred
predicts the probability of continuing at the current step, so why should it be multiplied by the discount factor?
I don't know what you mean by applied. The line you're pointing to computes the target for the discount predictor. Moreover, config.discount
is a scalar so it's the same at every time step.
Another implementation would have been to train a binary classifier to predict the episode ends with hard labels and later multiply its predictions by 0.99
when computing returns.
I chose to train the classifier with soft labels of 0
and 0.99
so they include the agent's discount factor. Thus, there is no further discounting when using the predictions to compute returns:
https://github.com/danijar/dreamer/blob/1e38b1d80e11335883108720e671d35cc34de96e/dreamer.py#L182-L185.
Hi
I see. That makes sense. I previously thought disc_pred
predicted whether the episode would continue at the same time step as the obs
. Therefore, I'm confused about the discount factor. As you explained, if we associate it with the following code, everything makes sense. Thank you so much!
Hi,
Thanks for your great work. May I ask what is pcont in your code? I cannot relate it to anything in the paper.
Thanks again.
Sherwin