Closed Anti-Entrophic closed 2 weeks ago
对需要有eos的好像没办法解决
已修改
有eos token:
tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=True, trust_remote_code=True)
train_dataset = CollieDatasetForTraining(train_dataset, tokenizer)
print(train_dataset[0])
# {'input_ids': [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281, 2]}
train_dataset_classification = CollieDatasetForClassification(train_dataset_classification, tokenizer)
print(train_dataset_classification[0])
# {'input_ids': ([1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281, 2], [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 30721, 281, 2]), 'attention_mask': ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'labels': ([-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281, 2], [-100, -100, -100, -100, -100, -100, -100, -100, -100, 30721, 281, 2]), 'target': 0}
无eos token:
tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=False, trust_remote_code=True)
train_dataset = CollieDatasetForTraining(train_dataset, tokenizer)
print(train_dataset[0])
# {'input_ids': [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281]}
train_dataset_classification = CollieDatasetForClassification(train_dataset_classification, tokenizer)
print(train_dataset_classification[0])
# {'input_ids': ([1, 918, 26413, 446, 550, 4137, 505, 334, 262, 41889, 281], [1, 918, 26413, 446, 550, 4137, 505, 334, 262, 30721, 281]), 'attention_mask': ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'labels': ([-100, -100, -100, -100, -100, -100, -100, -100, -100, 41889, 281], [-100, -100, -100, -100, -100, -100, -100, -100, -100, 30721, 281]), 'target': 0}
在一些dataset上,tokenizer会将input与output拼接后再统一tokenize,可能导致错误
例: 使用internlm2词表。此处会把input的最后一个token与output的第一个token拼起来