Closed CRLqinliang closed 1 year ago
@ClementRomac
Hi,
The algorithm actually updates both the value and the policy. You can find this update in the PPOUpdater
looking at at PPO's loss: https://github.com/flowersteam/Grounding_LLMs_with_online_RL/blob/276aab720d0c5e9ba8e39137dfdf9ae4c38b5c98/experiments/train_language_agent.py#LL264C13-L264C107
In our paper and in the code of this repo, the LLM isn't frozen, meaning that performing an update to minimize this loss will affect the whole model (i.e. the LLM and the value head). So as you said, updating the LLM (the whole network in our case) is needed for policy improvement.
As a piece of additional information, we also performed some very small tests freezing the LLM and training only the Language Modeling heads and value head, which resulted in poor results. One can however avoid finetuning the whole LLM by using a lightweight method such as LoRA: https://twitter.com/ClementRomac/status/1667120630762962945.
Hope this helps, do not hesitate to ask further questions :)
Okay, I thought the green part of the algorithm is frozen 😅
Recently, I am working on your code stuff. And I just want to ask a question about GFLAN-T5. Since this algorithm only updates the value head and uses the pre-trained language modeling heads. But to my knowledge, if we only update the value head, that would not influence the parameters of the pre-trained language modeling heads, so how does the policy network improve its ability? (Sorry, it might be a simple question, but I just cannot get it. Looking forward to your reply. Thx