OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.73k stars 164 forks source link

Actor-Critic-Model #230

Open mgerstgrasser opened 4 months ago

mgerstgrasser commented 4 months ago

If I understand the current PPO code correctly, this instantiates completely separate actor and critic models, without any layers shared between them. (But correct me in case that is wrong?)

Instead of that, is it possible to just have an additional critic output head on the actor model? (I.e. share all but the last layer between actor and critic, or any number of layers?)

hijkzzz commented 4 months ago

We do not support this feature currently

mgerstgrasser commented 4 months ago

Got it, thank you for replying so quickly!

Are there specific reasons against having this as an option? If not, would you potentially be open to a pull request to add it?

hijkzzz commented 4 months ago

We don't have the bandwidth to do that yet

mickel-liu commented 4 months ago

I was wondering the same thing, why isn't parameter sharing between policy and value model a thing?

Has anyone done experiments (beside FDU) to confirm why it is necessary to have separate parameters?

Doesn't make sense to me if there isn't any tangible benefits otherwise why not saving the memory?

mgerstgrasser commented 4 months ago

I was wondering the same thing, why isn't parameter sharing between policy and value model a thing?

Has anyone done experiments (beside FDU) to confirm why it is necessary to have separate parameters?

Doesn't make sense to me if there isn't any tangible benefits otherwise why not saving the memory?

There are some known drawbacks to parameter sharing, in particular it's been observed to negatively impact training early on. See appendix G.1 in "Learning to summarize from human feedback". So it's a trade-off between performance and memory footprint; but I agree it would be nice to have as an option. I might look into implementing it.

One other thing is that you could potentially have a smaller-size critic network than actor, e.g. keep the critic at 7B even if your actor is 70B. That would greatly reduce the memory overhead.