YifeiZhou02 / ArCHer

Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"
https://yifeizhou02.github.io/archer.io/
84 stars 10 forks source link

Will there be a useBaseline update? #11

Closed cuts2k closed 1 month ago

cuts2k commented 1 month ago

Hello, first thank you for sharing this code it’s been a great learning experience for me going through it!

If I understand correctly the token level baseline workflow is missing from this implementation, are there any plans to still add it to this repo? Especially since it looked like it makes a large difference in the paper?

Also 2 more questions if I may: How come you didn’t choose PPO for the token level model? (I might be missing something obvious here as I’m very new to RL but, to me, it looked like it would have been the default choice there).

Lastly, am I correct that to switch the critic model it should be sufficient to just change the name of the model in the config even if moving to a slightly different architecture?

YifeiZhou02 commented 1 month ago

Thanks for your interest in our work.

Unfortunately the token-level baseline is not supported in the repo, as it has been quite a while since I ran that ablation. '

For your other questions, indeed PPO seems to be the default choice these days. However, it can also be very sensitive to hyperparameter decisions. Furthermore, some recent works (e.g. https://arxiv.org/abs/2402.14740v1) show that many designs in PPO may not be necessary and a REINFORCE-style algorithm can work as well as PPO despite being more simple and stable.

I assume it should be fine if the architecture is not too much different (this interface is determined by huggingface). If you do not run into any runtime errors, it is likely good.

cuts2k commented 1 month ago

Thank you very much, I really appreciate the answers.