Closed BaohaoLiao closed 1 year ago
In my opinion, the original objective of diffusionlm (as well as ours) is to use the noisy embedding for this loss, and this is consistent with the derived loss function theoretically. However, you have a really good insight to alter to using model outputs for this part of loss. Actually, this thick is also proposed in a recent work named Difformer by MSRA and they name this trick as the anchor loss. We also try this trick with our model, and we observe a performance improvement on IWSLT.
I see. Thank you!!!
Hi @Yuanhy1997, I would like to ask if the anchor loss
you mentioned is calculated by
anchor_loss = self.token_discrete_loss(model_output, get_logits, input_ids, mask=loss_mask)
I am not sure if I am missing anything but this results in very bad generation for at least QQP datasets (only 10 BLEU score). You mentioned that you observed improvement on IWSLT, does it mean you also did not see improvement on other English only datasets (- what score do you get for e.g. QQP)? If so, do you know the reason? Really appreciating your input here!
Thank you!
There requires some optimization tricks for anchor loss to work in my cases. There are improvements on translation tasks, while I cannot recall whether this applies the same on QQP. Since Difformer did not release their codes, I can tell how exactly they implement anchor loss.
thanks yeah I was referring to your implementation (not Difformer), could you please share your anchor loss implementation on IWSLT that shows improvement?
Of course, I will check my codes soon.
Of course, I will check my codes soon.
I was wondering if the code is ready to share? Thank you really appreciate it!
Hi,
thank you again for your clean code. Here I have a question about the decoder_nll loss.
According to your code, you calculate the decoder_nll loss in this way:
which means that x_start is the noisy word embedding. You calculate the cross entropy between the noisy word embedding and input_ids. However, in the diffusion_lm, "We now describe the inverse process of rounding a predicted x0 back to discrete text" (second sentence in Section 4.2). It seems they use the predicted x_start ( model_output) rather than the noisy x_start.
I know the original code also implements it in this way, but it confuses me. Why don't we replace x_start in the function self.token_discrete_loss with model_out? The noisy word embedding x_start should be very close to the original word embedding, since at the beginning we only add little noise. We don't need to calculate its loss. Instead, we should make sure the predicted x_start (model_out) to be close to the word embedding.