Question again ：） - Githubissues

CRLqinliang commented 1 year ago

Recently, I am working on your code stuff. And I just want to ask a question about GFLAN-T5. Since this algorithm only updates the value head and uses the pre-trained language modeling heads. But to my knowledge, if we only update the value head, that would not influence the parameters of the pre-trained language modeling heads, so how does the policy network improve its ability? (Sorry, it might be a simple question, but I just cannot get it. Looking forward to your reply. Thx

CRLqinliang commented 1 year ago

@ClementRomac

ClementRomac commented 1 year ago

Hi,

The algorithm actually updates both the value and the policy. You can find this update in the PPOUpdater looking at at PPO's loss: https://github.com/flowersteam/Grounding_LLMs_with_online_RL/blob/276aab720d0c5e9ba8e39137dfdf9ae4c38b5c98/experiments/train_language_agent.py#LL264C13-L264C107

In our paper and in the code of this repo, the LLM isn't frozen, meaning that performing an update to minimize this loss will affect the whole model (i.e. the LLM and the value head). So as you said, updating the LLM (the whole network in our case) is needed for policy improvement.

As a piece of additional information, we also performed some very small tests freezing the LLM and training only the Language Modeling heads and value head, which resulted in poor results. One can however avoid finetuning the whole LLM by using a lightweight method such as LoRA: https://twitter.com/ClementRomac/status/1667120630762962945.

Hope this helps, do not hesitate to ask further questions :)

CRLqinliang commented 1 year ago

Okay, I thought the green part of the algorithm is frozen 😅

flowersteam / Grounding_LLMs_with_online_RL

Question again ：） #9