since the mask is assumed to be causal, previously we were doing the reward value on the last token in the sequence, not the EOS token. Did not affect llama-3 models much, since the EOS token is double applied in this instance (the chat has an EOS token already), but would be potentially problematic for other models.
since the mask is assumed to be causal, previously we were doing the reward value on the last token in the sequence, not the EOS token. Did not affect llama-3 models much, since the EOS token is double applied in this instance (the chat has an EOS token already), but would be potentially problematic for other models.