fix(dataset): concat input&output then tokenize

Anti-Entrophic commented 2 weeks ago

在一些dataset上，tokenizer会将input与output拼接后再统一tokenize，可能导致错误

例：使用internlm2词表。此处会把input的最后一个token与output的第一个token拼起来

train_dataset = [
    {
        "input": "The sentiment of this comment is: ",
        "output": "negative.",
    }
]

train_dataset_classification = [
    {
        "input": "The sentiment of this comment is: ",
        "output": ["negative.", "positive."],
        "target": 0,
    }
]

train_dataset = CollieDatasetForTraining(train_dataset, tokenizer)
print(train_dataset[0])
# {'input_ids': [1, 918, 26413, 446, 550, 4137, 505, 334, 8357, 281], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, 8357, 281]}
# 修改后
# {'input_ids': [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281]}
train_dataset_classification = CollieDatasetForClassification(train_dataset_classification, tokenizer)
print(train_dataset_classification[0])
# {'input_ids': ([1, 918, 26413, 446, 550, 4137, 505, 334, 8357, 281], [1, 918, 26413, 446, 550, 4137, 505, 334, 6936, 281]), 'attention_mask': ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'labels': ([-100, -100, -100, -100, -100, -100, -100, -100, 8357, 281], [-100, -100, -100, -100, -100, -100, -100, -100, 6936, 281]), 'target': 0}
# 修改后
# {'input_ids': ([1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281], [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 30721, 281]), 'attention_mask': ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'labels': ([-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281], [-100, -100, -100, -100, -100, -100, -100, -100, -100, 30721, 281]), 'target': 0}

print(tokenizer.encode("The sentiment of this comment is: "))
# [1, 918, 26413, 446, 550, 4137, 505, 334, 262]
print(tokenizer.encode("negative.", add_special_tokens=False))
# [41889, 281]

KaiLv69 commented 2 weeks ago

对需要有eos的好像没办法解决

Anti-Entrophic commented 2 weeks ago

已修改

有eos token：

tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=True, trust_remote_code=True)

train_dataset = CollieDatasetForTraining(train_dataset, tokenizer)
print(train_dataset[0])
# {'input_ids': [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281, 2]}

train_dataset_classification = CollieDatasetForClassification(train_dataset_classification, tokenizer)
print(train_dataset_classification[0])
# {'input_ids': ([1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281, 2], [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 30721, 281, 2]), 'attention_mask': ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'labels': ([-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281, 2], [-100, -100, -100, -100, -100, -100, -100, -100, -100, 30721, 281, 2]), 'target': 0}

无eos token：

tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=False, trust_remote_code=True)

train_dataset = CollieDatasetForTraining(train_dataset, tokenizer)
print(train_dataset[0])
# {'input_ids': [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281]}

train_dataset_classification = CollieDatasetForClassification(train_dataset_classification, tokenizer)
print(train_dataset_classification[0])
# {'input_ids': ([1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281], [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 30721, 281]), 'attention_mask': ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'labels': ([-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281], [-100, -100, -100, -100, -100, -100, -100, -100, -100, 30721, 281]), 'target': 0}

OpenMOSS / CoLLiE

fix(dataset): concat input&output then tokenize #190