danijar / dreamerv2

Mastering Atari with Discrete World Models
https://danijar.com/dreamerv2
MIT License
886 stars 195 forks source link

Does the actor-critc train using only the stochastic state? #20

Closed lewisboyd closed 2 years ago

lewisboyd commented 2 years ago

Hi,

I'm very interested in your work but I am unclear if the actor-critic is trained only using the stochastic state as its observation or if it also uses the recurrent state? What's the reasoning behind this choice?

Thanks for all your work and for putting it on Github!

danijar commented 2 years ago

Hey, it gets both as input. In a POMDP, it definitely needs the GRU state, because that summarizes the history of observations. Empirically, it does not seem to matter much whether it also receives the stochastic sample or not.

GoingMyWay commented 2 years ago

Hi @danijar, after reading the code and the paper. I am confused. In the paper, Fig 2 tells that the learned prior $\hat{z_t}$ is used for imagination. And in Equation (3), the actor takes $\hat{z_t}$. However, in the code, I found the actor uses the posterior $z_t$ as input together with $h_t$.

It seems they are different. Could you please help me to understand it?

danijar commented 2 years ago

During imagination training, the actor takes both the GRU state a sample from the prior as input. During environment interaction, the actor takes both the GRU state and a sample from the posterior as input.

We use the prior during imagination because we don't know the corresponding observations. We use the posterior during environment interaction because we know the current observation.

The prior and posterior are trained to be close to each other using the KL loss.

GoingMyWay commented 2 years ago

During imagination training, the actor takes both the GRU state a sample from the prior as input. During environment interaction, the actor takes both the GRU state and a sample from the posterior as input.

We use the prior during imagination because we don't know the corresponding observations. We use the posterior during environment interaction because we know the current observation.

The prior and posterior are trained to be close to each other using the KL loss.

@danijar I see. Thanks for the clarification.