为什么数据格式化要把输入和输出的合并起来放到input_id?

你好，看到比如微调的readme，比如self-llm/LLaMA3 /04-LLaMA3-8B-Instruct Lora 微调.md

数据格式化里面： input_ids, attention_mask, labels = [], [], [] instruction = tokenizer(f"<|start_header_id|>user<|end_header_id|>\n\n{example['instruction'] + example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", add_special_tokens=False) # add_special_tokens 不在开头加 special_tokens response = tokenizer(f"{example['output']}<|eot_id|>", add_special_tokens=False) input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id] attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1] # 因为eos token咱们也是要关注的所以补充为1 labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id] if len(input_ids) > MAX_LENGTH: # 做一个截断 input_ids = input_ids[:MAX_LENGTH]

为什么要把instruction和response合并放到一起作为input_id。生成式模型理论上更好理解的方式，是把输入和输出分开。而且上面的attention_mask看起来也没有区分到instruction和response，两个都是tokenizer的attention_mask区分不出什么东西吧。只有label前面的instruction是pad掉的。但是不好理解，为什么不把输入和输出分开呢？比如input_ids只有instruction，label只有response？

datawhalechina / self-llm

为什么数据格式化要把输入和输出的合并起来放到input_id? #129