Closed yananchen1989 closed 4 months ago
i double checked the paper https://arxiv.org/pdf/2403.02502 again, and have not found too much information about how to train the PPO. Can I say that the potentials of PPO in this repo has not been fully explored and investigated ?
Thanks for your interest in our paper!
train_ppo.py
in this repo is an offline implementation of PPO baseline. For the online PPO, since some environments (like WebShop) are not designed for RL training, we were unable to implement the agent-environment interaction in a single .py training script and used a shell script instead. We currently have no plans to open source the codes and datasets.
Actually, we have encountered a bunch of practical difficulties when implementing PPO baseline:
train_ppo.py
based on TRL.Recently, several awesome repos for fine-tuning LLM agents with RL are released, like LlamaGym and Lamorel. Maybe it will be a better startpoint to implement the online PPO baseline.
hi there, some quick questions about the train_ppo.py:
data_pm/webshop_ppo.json
yet in the repo.step
in the script, https://github.com/Yifan-Song793/ETO/blob/main/fastchat/train/train_ppo.py#L297, without anygenerate
, which means that the dataset is fully prepared with pre calculated rewards, and there are no iterations of model updates, during the ppo. May I ask why using this setting rather than generate-rewarding-update paradigm ? Thanks.