Open junmokane opened 3 months ago
Hi Junmo, thanks for your interest in the paper.
The reward distribution is a distribution over latent vectors (the discrete code sequences), not images, and the reward for the latent $z$ is $R(z)=p(z)*p(x|z)$, where $p(z)$ is the prior and $p(x\mid z)$ is the decoder's likelihood of generating $x$ from the latent $z$. The term you are asking about computes $\log p(x\mid x)$: the negative binary cross-entropy between the decoder's output (pred_img
) and $x$ (gt_img
) is the same as the log-likelihood of observing the target image $x$ when all the pixels are sampled from independent Bernoullis whose parameters' binary logits are given by the decoder's output.
This is just a trick for numerical stability. The TB loss for a trajectory $s_0\rightarrow s_1\rightarrow\dots\rightarrow s_n$ is $(\log Z+\log\prod P_F(si\mid s{i-1})-\log R(s_n))^2$, and we simply divided all terms inside the square by $n$ (and absorbed the $\frac1n$ constant into $\log Z$). The optimisation problem is equivalent with and without such division by $n$.
The generation of the latent is autoregressive (we sample the entries of the latent code one by one, in order). Therefore, each noninitial state has only one parent: the parent of a state (partial code) where the first $k$ entries have been chosen is the state where the first $k-1$ entries have been chosen and the $k$-th one is undetermined. This makes $PB(s{k-1}\mid sk)$ automatically 1 for all transitions $s{k-1}\rightarrow s_k$. It would be possible to use a different generation scheme where the policy can set the value of any undetermined entry in the latent code, not necessarily following a fixed order. In that case, we would indeed have a nontrivial backward policy. If the backward policy were fixed to uniform, it would be possible, as you noticed, to omit it from the loss and merge it into $\log Z$, since the product of backward transition likelihoods would be the same for all trajectories.
Hi, I'm Junmo Cho.
I've read the paper which was pretty interesting. Sorry for taking your time, but while running the code, I've got some questions.
It would be grateful if I can have some answers! Thanks.