"The generative process is the same as in auto-regressive language models: generation begins with an empty string, and at the 𝑖-th step a token 𝑧𝑖 is sampled"
Since the generative process is conducted token by token, I'm wondering about what is the meaning of calculating a reward for an incomplete sentence in the learning objective? Thanks if you can help me understand this :)
Hi @StarDewXXX, to compute the intermediate reward (i.e. after each token) we append an EOS token to the tokens generated so far and then compute the reward.
"The generative process is the same as in auto-regressive language models: generation begins with an empty string, and at the 𝑖-th step a token 𝑧𝑖 is sampled"
Since the generative process is conducted token by token, I'm wondering about what is the meaning of calculating a reward for an incomplete sentence in the learning objective? Thanks if you can help me understand this :)