danijar / dreamer

Dream to Control: Learning Behaviors by Latent Imagination
https://danijar.com/dreamer
MIT License
507 stars 109 forks source link

What is the meaning of pcont? #2

Closed xlnwel closed 4 years ago

xlnwel commented 4 years ago

Hi,

Thanks for your great work. May I ask what is pcont in your code? I cannot relate it to anything in the paper.

Thanks again.

Sherwin

danijar commented 4 years ago

It's just a different name for discount, the probability of the episode continuing. For episodic tasks, the agent needs to learn this to take it into account during imagination training.

xlnwel commented 4 years ago

Hi,

Thanks for answering but I still want to know if pcont is useful in practice? I noticed you did not use it by default.

Best,

Sherwin

danijar commented 4 years ago

Some environments always have a discount factor of 1, others have a discount factor of 0 at the terminal step. In the first case, the agent learns as if the episode goes on indefinitely. In the latter case, the agent tries to achieve goals before the discount factor goes to zero.

The DeepMind Control Suite environments end due to time limit, so their discount factor is always 1. Thus, an agent can learn as if it is looking at a smaller part of an infinitely long episode. For example, a Q-learning agent would always bootstrap against the future Q-value.

Some other environments have episodes that terminate before the time limit. For example, the agent might die. This means the discount factor is 0 at the episode and and the agent should consider it as a termination. For example, a Q-learning agent would bootstrap against zero at those steps.

The Dreamer code base supports episodes with such early termination by predicting the discount factor, which is called pcont for "probability of continuing". It is disabled the defaults are for the DeepMind Control Suite environments. However, it might be useful for other tasks.

xlnwel commented 4 years ago

I see. Thank you for the detailed explanation:-)

xlnwel commented 4 years ago

Hi @danijar,

Why do you multiply the pcont by the discount factor at this line?

danijar commented 4 years ago

This is to translate from the discount factor of the environment to the discount factor of the agent. The environment simply reports the probability of the episode continuing, which is typically one for normal time steps and zero for episode ends. However, many RL algorithms including Dreamer use additional discounting to reduce the variance of value estimates. Multiplying by the agent's discount factor of 0.99 turns the ones and zeros from the environment into 0.99s and zeros for learning by the agent.

xlnwel commented 4 years ago

Hi @danijar

Sorry, I still haven't got the idea. Shouldn't the discount factor be applied to the next time step? As far as I understand, pcont_pred predicts the probability of continuing at the current step, so why should it be multiplied by the discount factor?

danijar commented 4 years ago

I don't know what you mean by applied. The line you're pointing to computes the target for the discount predictor. Moreover, config.discount is a scalar so it's the same at every time step.

Another implementation would have been to train a binary classifier to predict the episode ends with hard labels and later multiply its predictions by 0.99 when computing returns.

I chose to train the classifier with soft labels of 0 and 0.99 so they include the agent's discount factor. Thus, there is no further discounting when using the predictions to compute returns: https://github.com/danijar/dreamer/blob/1e38b1d80e11335883108720e671d35cc34de96e/dreamer.py#L182-L185.

xlnwel commented 4 years ago

Hi

I see. That makes sense. I previously thought disc_pred predicted whether the episode would continue at the same time step as the obs. Therefore, I'm confused about the discount factor. As you explained, if we associate it with the following code, everything makes sense. Thank you so much!

https://github.com/danijar/dreamer/blob/1e38b1d80e11335883108720e671d35cc34de96e/dreamer.py#L182-L185