Questions about eval_loss.py

chengeharrison commented 9 months ago

In Line 58, we calculate the number of tokens using attention_mask = attention_mask[:, :-1] and torch.sum(attention_mask).item(). But do we need to shift attention mask? Maybe torch.sum(attention_mask).item() - batch_size(without shifting) is correct?

For example if the batch size is 2, the input_ids can be [[1, 2, 3], [1, 2, pad]] and the attention mask is [[True, True, True], [True, True, False]]. Using attention_mask = attention_mask[:, :-1] and torch.sum(attention_mask).item() will output 4 as the number of tokens. But actually the token number should be 3 because we only calculate logits on [2, 3] and [2, pad] (first label is shifted by label = label[:, 1: ]) and pad isn't counted as a valid token for calculating loss. If we set IGNORE_INDEX in labels according to attention_mask, we don't need shifted attention mask when calculating loss.

A code example:

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-13B-base", use_fast=False, trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token

tokenized_texts = tokenizer(["This is an example text.", "这是一个实例文本，这句话比较长。"], add_special_tokens=False, padding=True, truncation=True, max_length=128, return_tensors="pt")

input_ids = tokenized_texts.input_ids
attention_mask = tokenized_texts.attention_mask

print(f"Input sequence length: {input_ids.size(1)}")
print("Input labels:")
print(input_ids)
print("Input attention mask:")
print(attention_mask)
print(f"num_tokens: {torch.sum(attention_mask).item() - input_ids.size(0)}")

shift_attention_mask = attention_mask[:, :-1]
print("Shifted attention mask:")
print(shift_attention_mask)
print(f"num_tokens: {torch.sum(shift_attention_mask).item()}")

output is:

Input sequence length: 13
Input labels:
tensor([[  910,   338,   385,  1342,  1426, 29889,     2,     2,     2,     2,
             2,     2,     2],
        [29871, 30810, 30392, 41176, 50921, 45522, 30214, 30810, 32760, 31852,
         40579, 31143, 30267]])
Input attention mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
num_tokens: 17
Shifted attention mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
num_tokens: 18

But from the Input labels output, it is obvious that the token num should be 17 (first sample 338 to 29889 and second 30810 to 30267).

chengeharrison commented 9 months ago

And typo in Line 77? Should be tokenizer.padding_side = "right" instead?

zhao1iang commented 9 months ago

Thank you for providing your feedback. We appreciate your comment and acknowledge the validity of your point. We will address the issue mentioned and rectify the typo in the upcoming upload.

zhao1iang commented 9 months ago

The typo and token number calculation issue in the eval_loss.py script have been addressed.

SkyworkAI / Skywork

Questions about eval_loss.py #50