questions about PPO - Githubissues

yananchen1989 commented 4 months ago

hi there, some quick questions about the train_ppo.py:

is there any plan to release the dataset to train ppo ? https://github.com/Yifan-Song793/ETO/blob/main/fastchat/train/train_ppo.py#L65 . Have not found the data_pm/webshop_ppo.json yet in the repo.
any bash script to run fastchat/train/train_ppo.py ? Would like to know the parameters passed into the training.
seems like that for the ppo training, there is only stepin the script, https://github.com/Yifan-Song793/ETO/blob/main/fastchat/train/train_ppo.py#L297, without any generate, which means that the dataset is fully prepared with pre calculated rewards, and there are no iterations of model updates, during the ppo. May I ask why using this setting rather than generate-rewarding-update paradigm ? Thanks.

yananchen1989 commented 4 months ago

i double checked the paper https://arxiv.org/pdf/2403.02502 again, and have not found too much information about how to train the PPO. Can I say that the potentials of PPO in this repo has not been fully explored and investigated ?

Yifan-Song793 commented 4 months ago

Thanks for your interest in our paper!

train_ppo.py in this repo is an offline implementation of PPO baseline. For the online PPO, since some environments (like WebShop) are not designed for RL training, we were unable to implement the agent-environment interaction in a single .py training script and used a shell script instead. We currently have no plans to open source the codes and datasets.

Actually, we have encountered a bunch of practical difficulties when implementing PPO baseline:

The environments are not designed for RL training. For example, some environment does not support parallel running, leading to very low efficiency.
Current LLM RL frameworks do not support multi-turn scenario. Therefore, we implement train_ppo.py based on TRL.
PPO is very unstable in our multi-turn scenario. We have tested several LLM RL frameworks, including TRL, TRLX. Take ALFWorld for example, we find under most hyperparameters, the average reward will be even lower than 10%.

Recently, several awesome repos for fine-tuning LLM agents with RL are released, like LlamaGym and Lamorel. Maybe it will be a better startpoint to implement the online PPO baseline.

Yifan-Song793 / ETO

questions about PPO #4