As per my understanding of the codebase, train_on_inputs is to mask the input in the datapoint. So on masking, the label should look like <Instruction TOKS> <Input MASK> <Output TOKS>. However, tokenized_user_prompt would be of the format <Instruction TOKS> <Input TOKS> (as output has been set to empty string) say of length L. Then the tokenized_full_prompt["labels"] would be <-100 * L> <Instruction TOKS> <Input TOKS> (first L would be instruction and input tokens only). Hence, no input masking is being done, and more so, the output tokens will also have been removed from the labels during loss calculation.
I hope I haven't made any errors in understanding.
Thanks
As per my understanding of the codebase, train_on_inputs is to mask the input in the datapoint. So on masking, the label should look like
<Instruction TOKS> <Input MASK> <Output TOKS>.
However,tokenized_user_prompt
would be of the format<Instruction TOKS> <Input TOKS>
(as output has been set to empty string) say of lengthL.
Then the tokenized_full_prompt["labels"] would be<-100 * L> <Instruction TOKS> <Input TOKS>
(first L would be instruction and input tokens only). Hence, no input masking is being done, and more so, the output tokens will also have been removed from the labels during loss calculation.I hope I haven't made any errors in understanding. Thanks