datawhalechina / self-llm

《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合中国宝宝的部署教程
Apache License 2.0
9.66k stars 1.13k forks source link

LLaMA3_1-8B-Instruct Lora 微调 数据格式化问题 #275

Open Evilxya opened 3 weeks ago

Evilxya commented 3 weeks ago

我注意到response里面添加了<|eot_id|>,但是在input_ids中同样添加了[tokenizer.pad_token_id],这两个是不是添加重复了呢?

def process_func(example): MAX_LENGTH = 384 # Llama分词器会将一个中文字切分为多个token,因此需要放开一些最大长度,保证数据的完整性 input_ids, attention_mask, labels = [], [], [] instruction = tokenizer(f"<|start_header_id|>user<|end_header_id|>\n\n{example['instruction'] + example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", add_special_tokens=False) # add_special_tokens 不在开头加 special_tokens response = tokenizer(f"{example['output']}<|eot_id|>", add_special_tokens=False) input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id] attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1] # 因为eos token咱们也是要关注的所以 补充为1 labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id] if len(input_ids) > MAX_LENGTH: # 做一个截断 input_ids = input_ids[:MAX_LENGTH] attention_mask = attention_mask[:MAX_LENGTH] labels = labels[:MAX_LENGTH] return { "input_ids": input_ids, "attention_mask": attention_mask, "labels": labels }

GithubX-F commented 1 week ago

    input_ids = instruction["input_ids"] + response["input_ids"]
    attention_mask = instruction["attention_mask"] + response["attention_mask"]

    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"]