AGI-Edgerunners / LLM-Adapters

Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"
https://arxiv.org/abs/2304.01933
Apache License 2.0
1.02k stars 92 forks source link

Possible Error in generate_and_tokenize_prompt in finetune.py #12

Open shreygupta2809 opened 1 year ago

shreygupta2809 commented 1 year ago
def generate_and_tokenize_prompt(data_point):
  full_prompt = generate_prompt(data_point)
  tokenized_full_prompt = tokenize(full_prompt)
  if not train_on_inputs:
    user_prompt = generate_prompt({**data_point, "output": ""})
    tokenized_user_prompt = tokenize(user_prompt, add_eos_token=False)
    user_prompt_len = len(tokenized_user_prompt["input_ids"])

    tokenized_full_prompt["labels"] = [
                                          -100
                                      ] * user_prompt_len + tokenized_full_prompt["labels"][
                                                            user_prompt_len:
                                                            ]  #

As per my understanding of the codebase, train_on_inputs is to mask the input in the datapoint. So on masking, the label should look like <Instruction TOKS> <Input MASK> <Output TOKS>. However, tokenized_user_prompt would be of the format <Instruction TOKS> <Input TOKS> (as output has been set to empty string) say of length L. Then the tokenized_full_prompt["labels"] would be <-100 * L> <Instruction TOKS> <Input TOKS> (first L would be instruction and input tokens only). Hence, no input masking is being done, and more so, the output tokens will also have been removed from the labels during loss calculation.

I hope I haven't made any errors in understanding. Thanks