Some questions about the paper

andyclsr commented 5 days ago

Hi dear authors, thanks for your excellent work and opening-source the code. I have a question about Eq 13 as you claimed that the reward is sparse and only when a_t is EOS will it be non-zero. But I also found that you are discussing it in the context of token-level MDP where the token-level reward is considered. Will it be contradictory? And since one could also claim that the reward is r(s,a) = log(...)-log(...) as in [1]. Could you give some insights about it? Anyway, thank you very much!

[1] Rafailov, Rafael, et al. "From $ r $ to $ Q^* $: Your Language Model is Secretly a Q-Function."

ZHZisZZ commented 1 day ago

The original reward model is sparse, but we can reparametrize it using the dense logp from language models. See Section 4 for details.

For any other paper-related questions (not code), feel free to email me!

andyclsr commented 21 hours ago

Thanks anyway :)

ZHZisZZ / weak-to-strong-search

Some questions about the paper #4