[Some questions about implementation]

Hi, I'm Junmo Cho.

I've read the paper which was pretty interesting. Sorry for taking your time, but while running the code, I've got some questions.

Is minus of binary_cross_entropy between img, and pred_img coming from assuming the reward distribution as Bernoulli distribution? I thought that for each pixel in img (which is gt target, and value is 1 or 0) is used as Ber(y|pi) = pi^y * (1-pi)^(1-y) where y is pixel and from for each pixel dist in pred_img, we input it for pi. Please correct me if my understanding is wrong.
Another thing is why do we divide steps (which is length of generation sequence of GFN - 16 here) for logprobs, and reward when calculating TB loss? I thought that logprobs is itself log of production of P_F(si | s{i-1}) from i=1 to n as in the paper.
Also, why there is no backward policy term in the TB loss? Are we assuming backward policy as uniform and involve it in logZ?

It would be grateful if I can have some answers! Thanks.

Hi Junmo, thanks for your interest in the paper.

The reward distribution is a distribution over latent vectors (the discrete code sequences), not images, and the reward for the latent $z$ is $R(z)=p(z)*p(x|z)$, where $p(z)$ is the prior and $p(x\mid z)$ is the decoder's likelihood of generating $x$ from the latent $z$. The term you are asking about computes $\log p(x\mid x)$: the negative binary cross-entropy between the decoder's output (pred_img) and $x$ (gt_img) is the same as the log-likelihood of observing the target image $x$ when all the pixels are sampled from independent Bernoullis whose parameters' binary logits are given by the decoder's output.
This is just a trick for numerical stability. The TB loss for a trajectory $s_0\rightarrow s_1\rightarrow\dots\rightarrow s_n$ is $(\log Z+\log\prod P_F(si\mid s{i-1})-\log R(s_n))^2$, and we simply divided all terms inside the square by $n$ (and absorbed the $\frac1n$ constant into $\log Z$). The optimisation problem is equivalent with and without such division by $n$.
The generation of the latent is autoregressive (we sample the entries of the latent code one by one, in order). Therefore, each noninitial state has only one parent: the parent of a state (partial code) where the first $k$ entries have been chosen is the state where the first $k-1$ entries have been chosen and the $k$-th one is undetermined. This makes $PB(s{k-1}\mid sk)$ automatically 1 for all transitions $s{k-1}\rightarrow s_k$. It would be possible to use a different generation scheme where the policy can set the value of any undetermined entry in the latent code, not necessarily following a fixed order. In that case, we would indeed have a nontrivial backward policy. If the backward policy were fixed to uniform, it would be possible, as you noticed, to omit it from the loss and merge it into $\log Z$, since the product of backward transition likelihoods would be the same for all trajectories.

GFNOrg / GFlowNet-EM

[Some questions about implementation] #4