Closed andyclsr closed 1 day ago
The original reward model is sparse, but we can reparametrize it using the dense logp from language models. See Section 4 for details.
For any other paper-related questions (not code), feel free to email me!
Thanks anyway :)
Hi dear authors, thanks for your excellent work and opening-source the code. I have a question about Eq 13 as you claimed that the reward is sparse and only when a_t is EOS will it be non-zero. But I also found that you are discussing it in the context of token-level MDP where the token-level reward is considered. Will it be contradictory? And since one could also claim that the reward is r(s,a) = log(...)-log(...) as in [1]. Could you give some insights about it? Anyway, thank you very much!
[1] Rafailov, Rafael, et al. "From $ r $ to $ Q^* $: Your Language Model is Secretly a Q-Function."