Possible Error in generate_and_tokenize_prompt in finetune.py

def generate_and_tokenize_prompt(data_point):
  full_prompt = generate_prompt(data_point)
  tokenized_full_prompt = tokenize(full_prompt)
  if not train_on_inputs:
    user_prompt = generate_prompt({**data_point, "output": ""})
    tokenized_user_prompt = tokenize(user_prompt, add_eos_token=False)
    user_prompt_len = len(tokenized_user_prompt["input_ids"])

    tokenized_full_prompt["labels"] = [
                                          -100
                                      ] * user_prompt_len + tokenized_full_prompt["labels"][
                                                            user_prompt_len:
                                                            ]  #

As per my understanding of the codebase, train_on_inputs is to mask the input in the datapoint. So on masking, the label should look like <Instruction TOKS> <Input MASK> <Output TOKS>. However, tokenized_user_prompt would be of the format <Instruction TOKS> <Input TOKS> (as output has been set to empty string) say of length L. Then the tokenized_full_prompt["labels"] would be <-100 * L> <Instruction TOKS> <Input TOKS> (first L would be instruction and input tokens only). Hence, no input masking is being done, and more so, the output tokens will also have been removed from the labels during loss calculation.

I hope I haven't made any errors in understanding. Thanks

AGI-Edgerunners / LLM-Adapters

Possible Error in generate_and_tokenize_prompt in finetune.py #12