Question about decoder_nll loss

BaohaoLiao commented 1 year ago

Hi,

thank you again for your clean code. Here I have a question about the decoder_nll loss.

According to your code, you calculate the decoder_nll loss in this way:

x_start_mean = model.model.module.get_embeds(input_ids)
x_start = self.get_x_start(x_start_mean, std)
decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids, mask=loss_mask)

def token_discrete_loss(self, x_t, get_logits, input_ids, mask=None):
    logits = get_logits(x_t)  # bsz, seqlen, vocab
    loss_fct = th.nn.CrossEntropyLoss(reduction="none", ignore_index=-100)
    decoder_nll = loss_fct(logits.view(-1, logits.size(-1)), input_ids.view(-1)).view(
        input_ids.shape
    )
    if mask is not None:
        decoder_nll[mask == 0] = 0
        decoder_nll = decoder_nll.sum(dim=-1) / mask.sum(dim=-1)
    else:
        decoder_nll = decoder_nll.mean(dim=-1)
    return decoder_nll

which means that x_start is the noisy word embedding. You calculate the cross entropy between the noisy word embedding and input_ids. However, in the diffusion_lm, "We now describe the inverse process of rounding a predicted x0 back to discrete text" (second sentence in Section 4.2). It seems they use the predicted x_start ( model_output) rather than the noisy x_start.

I know the original code also implements it in this way, but it confuses me. Why don't we replace x_start in the function self.token_discrete_loss with model_out? The noisy word embedding x_start should be very close to the original word embedding, since at the beginning we only add little noise. We don't need to calculate its loss. Instead, we should make sure the predicted x_start (model_out) to be close to the word embedding.

Yuanhy1997 commented 1 year ago

In my opinion, the original objective of diffusionlm (as well as ours) is to use the noisy embedding for this loss, and this is consistent with the derived loss function theoretically. However, you have a really good insight to alter to using model outputs for this part of loss. Actually, this thick is also proposed in a recent work named Difformer by MSRA and they name this trick as the anchor loss. We also try this trick with our model, and we observe a performance improvement on IWSLT.

BaohaoLiao commented 1 year ago

I see. Thank you!!!

chenxwh commented 1 year ago

Hi @Yuanhy1997, I would like to ask if the anchor loss you mentioned is calculated by

anchor_loss = self.token_discrete_loss(model_output, get_logits, input_ids, mask=loss_mask)

I am not sure if I am missing anything but this results in very bad generation for at least QQP datasets (only 10 BLEU score). You mentioned that you observed improvement on IWSLT, does it mean you also did not see improvement on other English only datasets (- what score do you get for e.g. QQP)? If so, do you know the reason? Really appreciating your input here!

Thank you!

Yuanhy1997 commented 1 year ago

There requires some optimization tricks for anchor loss to work in my cases. There are improvements on translation tasks, while I cannot recall whether this applies the same on QQP. Since Difformer did not release their codes, I can tell how exactly they implement anchor loss.

chenxwh commented 1 year ago

thanks yeah I was referring to your implementation (not Difformer), could you please share your anchor loss implementation on IWSLT that shows improvement?

Yuanhy1997 commented 1 year ago

Of course, I will check my codes soon.

chenxwh commented 1 year ago

Of course, I will check my codes soon.

I was wondering if the code is ready to share? Thank you really appreciate it!

Yuanhy1997 / SeqDiffuSeq

Question about decoder_nll loss #6