SkyworkAI / Skywork

Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation methods, etc. 天工系列模型在3.2TB高质量多语言和代码数据上进行预训练。我们开源了模型参数,训练数据,评估数据,评估方法。
Other
1.21k stars 111 forks source link

Questions about eval_loss.py #50

Closed chengeharrison closed 9 months ago

chengeharrison commented 9 months ago

In Line 58, we calculate the number of tokens using attention_mask = attention_mask[:, :-1] and torch.sum(attention_mask).item(). But do we need to shift attention mask? Maybe torch.sum(attention_mask).item() - batch_size(without shifting) is correct?

For example if the batch size is 2, the input_ids can be [[1, 2, 3], [1, 2, pad]] and the attention mask is [[True, True, True], [True, True, False]]. Using attention_mask = attention_mask[:, :-1] and torch.sum(attention_mask).item() will output 4 as the number of tokens. But actually the token number should be 3 because we only calculate logits on [2, 3] and [2, pad] (first label is shifted by label = label[:, 1: ]) and pad isn't counted as a valid token for calculating loss. If we set IGNORE_INDEX in labels according to attention_mask, we don't need shifted attention mask when calculating loss.

A code example:

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-13B-base", use_fast=False, trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token

tokenized_texts = tokenizer(["This is an example text.", "这是一个实例文本,这句话比较长。"], add_special_tokens=False, padding=True, truncation=True, max_length=128, return_tensors="pt")

input_ids = tokenized_texts.input_ids
attention_mask = tokenized_texts.attention_mask

print(f"Input sequence length: {input_ids.size(1)}")
print("Input labels:")
print(input_ids)
print("Input attention mask:")
print(attention_mask)
print(f"num_tokens: {torch.sum(attention_mask).item() - input_ids.size(0)}")

shift_attention_mask = attention_mask[:, :-1]
print("Shifted attention mask:")
print(shift_attention_mask)
print(f"num_tokens: {torch.sum(shift_attention_mask).item()}")

output is:

Input sequence length: 13
Input labels:
tensor([[  910,   338,   385,  1342,  1426, 29889,     2,     2,     2,     2,
             2,     2,     2],
        [29871, 30810, 30392, 41176, 50921, 45522, 30214, 30810, 32760, 31852,
         40579, 31143, 30267]])
Input attention mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
num_tokens: 17
Shifted attention mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
num_tokens: 18

But from the Input labels output, it is obvious that the token num should be 17 (first sample 338 to 29889 and second 30810 to 30267).

chengeharrison commented 9 months ago

And typo in Line 77? Should be tokenizer.padding_side = "right" instead?

zhao1iang commented 9 months ago

Thank you for providing your feedback. We appreciate your comment and acknowledge the validity of your point. We will address the issue mentioned and rectify the typo in the upcoming upload.

zhao1iang commented 9 months ago

The typo and token number calculation issue in the eval_loss.py script have been addressed.