Open wasiahmad opened 3 years ago
hello, the code assumes that the padding index = 1, please see the related issue here as well: https://github.com/facebookresearch/unlikelihood_training/issues/3
The idea here is to make ctx_cands
of size [num_target_tokens, num_target_tokens]
such that the row i
contains the previous context token indices and pad token index.
Then the lprobs mask of size [num_predicted_tokens, vocab_size]
is created using scatter operation:
negative_targets = torch.zeros_like(lprobs).scatter_(1, ctx_cands, 1)
here for every time step position we have full vocab vector where we assign 1 to every token index which needs to be penalized (i.e. negative candidate).
Later this mask is used in order to obtain the final UL loss here:
custom_loss = -torch.log(one_minus_probs)*negative_targets
Please feel free to ask further questions if you have any!
Hi, I am trying to understand the following code snippet.
https://github.com/facebookresearch/unlikelihood_training/blob/723747171a3fa909cda68df399e39f0a3e5067d9/custom/candidate_penalty_ce_loss.py#L50-L59
If my understanding is correct,
ctx_cands
is a square matrix where each dimension is of sizebatch_size x sequence_len
after the following statement.If I assume,
self.padding_idx=0
, what is the point of the following two statements.Because after the above two statements,
ctx_cands_
will be a zero tensor. Isn't it?Can you please explain how the lines of code pick the previous context tokens as negative candidates?