LLaMA3_1-8B-Instruct Lora 微调数据格式化问题

我注意到response里面添加了<|eot_id|>，但是在input_ids中同样添加了[tokenizer.pad_token_id]，这两个是不是添加重复了呢？

def process_func(example): MAX_LENGTH = 384 # Llama分词器会将一个中文字切分为多个token，因此需要放开一些最大长度，保证数据的完整性 input_ids, attention_mask, labels = [], [], [] instruction = tokenizer(f"<|start_header_id|>user<|end_header_id|>\n\n{example['instruction'] + example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", add_special_tokens=False) # add_special_tokens 不在开头加 special_tokens response = tokenizer(f"{example['output']}<|eot_id|>", add_special_tokens=False) input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id] attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1] # 因为eos token咱们也是要关注的所以补充为1 labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id] if len(input_ids) > MAX_LENGTH: # 做一个截断 input_ids = input_ids[:MAX_LENGTH] attention_mask = attention_mask[:MAX_LENGTH] labels = labels[:MAX_LENGTH] return { "input_ids": input_ids, "attention_mask": attention_mask, "labels": labels }

datawhalechina / self-llm

LLaMA3_1-8B-Instruct Lora 微调数据格式化问题 #275

datawhalechina / self-llm

LLaMA3_1-8B-Instruct Lora 微调 数据格式化问题 #275

LLaMA3_1-8B-Instruct Lora 微调数据格式化问题 #275