Open AgentDS opened 2 weeks ago
A modified version can be
def __get_ppl(self, input_texts: List[str], mask_length=None):
if self.call_api:
return api_get_ppl(self.api_name, input_texts)
self.tokenizer.padding_side = "right"
inputs = self.tokenizer(input_texts, padding=True, return_tensors='pt', truncation=True)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
outputs = self.model(**inputs)
shift_logits = outputs.logits[..., :-1, :].contiguous()
shift_labels = inputs["input_ids"][..., 1:].contiguous()
shift_attention_mask_batch = inputs["attention_mask"][..., 1:].contiguous()
loss_fct = torch.nn.CrossEntropyLoss(reduction='none', ignore_index=self.tokenizer.pad_token_id)
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)).view(
shift_labels.size())
if mask_length is not None:
mask = torch.zeros_like(shift_labels) # [batch,seqlen]
for i in range(len(mask)):
for j in range(mask_length[i] - 1, len(mask[i])):
mask[i][j] = 1
loss = loss * mask
lens = shift_attention_mask_batch.sum(1).cpu().numpy()
if mask_length is not None:
lens -= np.array(mask_length)
ce_loss = loss.sum(-1).cpu().detach().numpy() / lens
return ce_loss
In __get_ppl() of
PPLInferencer
, at line 186where it tries to calculate the token number of each text sample in
input_texts
, by count the number of token IDs that do not equal totokenizer.pad_token_id
.However, when we calculate the
loss
, the number of tokens calculated actually starts from the second token rather the beginning of eachinputs
as shown in line 173Thus, I think the correct way to calculate the token number for line 186 should be
The new version will have very small difference from the original version, that is,
new_lens = orig_lens - 1
.For reference: