Open leekum2018 opened 1 year ago
Hi,
Yes, we generate all tokens in one diffusion step. We use ddim sampling to predict $x0$ and get $x{t-1}$ from the forward process. The demonstration in Table 1 shows the input of BERT at time step $t-1$.
Besides, the corresponding predicted $x_0$ is composed of less informative tokens when $t$ is large and gradually shows semantic meaning as $t$ goes to 0. That is also the motivation of our spindle noise schedule.
Hope this helps. If you have more questions please feel free to contact with me.
Thank you for your reply! I have a further question. According to your reply, does it means you model $p{\theta}(x{t-1}|xt)$ as And is the term $\widetilde{p}(\widetilde{x}{0}|x_t)$ the output of BERT? Thank you!
Yes, that's right. DDIM sampling helps to trade off speed and generation quality. And predicting $x_0$ directly is closer to the MLM training objective.
Hi, I have another question. In eq. 9, how to compute $H(x{0}^{i})$, in other words, what is the distribution of $x{0}^{i}$ for calculating $H(x_{0}^{i})$. Because I have a hard time understanding why the following equation holds. Thank you!
Hi,
In fact, $H(x_0^i)$ can be calculated in many ways. We calculate the entropy of each token by the negative logarithm of its frequency in the tokenized training corpus.
Since a masked token loses all its information, the expected information loss of the i-th token at $t$ is $\overline{\alpha}_t^iH(\textbf{x}_0^i)$. We get Eq. 9 by taking the sum over the sequence.
Hope this helps.
For the following formula Structured Denoising Diffusion Models in Discrete State-Spaces, why the RHS is proportional to RHS? Could you please give me some hints? I have a hard time deriving this. Please.
Hi @leekum2018,
you can refer this : https://openreview.net/forum?id=h7-XixPCAL¬eId=xm7onR_Sg0L
Hope it helps!
As said in the second paragraph of Section 4.3, "We attribute the superior performance of DiffusionBERT to its onetime sampling of all tokens". I wonder the meaning of "onetime sampling of all tokens", does it mean generating all the tokens in a sentence at a time? If it does, it seems to conflict with the demonstration in Table 1. Thank you!