Question about the learning objective mentioned in paper

GFNOrg / gfn-lm-tuning

MIT License

103 stars 16 forks source link

Question about the learning objective mentioned in paper #5

Open StarDewXXX opened 3 months ago

StarDewXXX commented 3 months ago

"The generative process is the same as in auto-regressive language models: generation begins with an empty string, and at the 𝑖-th step a token 𝑧𝑖 is sampled"

Since the generative process is conducted token by token, I'm wondering about what is the meaning of calculating a reward for an incomplete sentence in the learning objective? Thanks if you can help me understand this :)

MJ10 commented 3 months ago

Hi @StarDewXXX, to compute the intermediate reward (i.e. after each token) we append an EOS token to the tokens generated so far and then compute the reward.