Hzfinfdu / Diffusion-BERT

ACL'2023: DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models
Apache License 2.0
286 stars 24 forks source link

Inquiry on some details of the method. #8

Open leekum2018 opened 1 year ago

leekum2018 commented 1 year ago

As said in the second paragraph of Section 4.3, "We attribute the superior performance of DiffusionBERT to its onetime sampling of all tokens". I wonder the meaning of "onetime sampling of all tokens", does it mean generating all the tokens in a sentence at a time? If it does, it seems to conflict with the demonstration in Table 1. Thank you!

Hzfinfdu commented 1 year ago

Hi,

Yes, we generate all tokens in one diffusion step. We use ddim sampling to predict $x0$ and get $x{t-1}$ from the forward process. The demonstration in Table 1 shows the input of BERT at time step $t-1$.

Besides, the corresponding predicted $x_0$ is composed of less informative tokens when $t$ is large and gradually shows semantic meaning as $t$ goes to 0. That is also the motivation of our spindle noise schedule.

Hope this helps. If you have more questions please feel free to contact with me.

leekum2018 commented 1 year ago

Thank you for your reply! I have a further question. According to your reply, does it means you model $p{\theta}(x{t-1}|xt)$ as Screenshot 2022-12-21 at 12 29 17 And is the term $\widetilde{p}(\widetilde{x}{0}|x_t)$ the output of BERT? Thank you!

Hzfinfdu commented 1 year ago

Yes, that's right. DDIM sampling helps to trade off speed and generation quality. And predicting $x_0$ directly is closer to the MLM training objective.

leekum2018 commented 1 year ago

Hi, I have another question. In eq. 9, how to compute $H(x{0}^{i})$, in other words, what is the distribution of $x{0}^{i}$ for calculating $H(x_{0}^{i})$. Because I have a hard time understanding why the following equation holds. Screenshot 2022-12-27 at 13 33 19 Thank you!

Hzfinfdu commented 1 year ago

Hi,

In fact, $H(x_0^i)$ can be calculated in many ways. We calculate the entropy of each token by the negative logarithm of its frequency in the tokenized training corpus.

Since a masked token loses all its information, the expected information loss of the i-th token at $t$ is $\overline{\alpha}_t^iH(\textbf{x}_0^i)$. We get Eq. 9 by taking the sum over the sequence.

Hope this helps.

leekum2018 commented 1 year ago

For the following formula Structured Denoising Diffusion Models in Discrete State-Spaces, why the RHS is proportional to RHS? Could you please give me some hints? I have a hard time deriving this. Please. image

Siddharth-Shrivastava7 commented 1 year ago

Hi @leekum2018,

you can refer this : https://openreview.net/forum?id=h7-XixPCAL&noteId=xm7onR_Sg0L

Hope it helps!