Closed xesdiny closed 1 year ago
Hey we went ahead with different policy and value networks for no particular reason. Ofcourse, there could be memory optimizations with shared policy and value networks. Feel free to adapt the policy implementation for your usecase. Also, be reminded that self._ref_model
is kept constant so attaching _value_head to it does not make sense.
"Of course, there could be memory optimizations with shared policy and value networks." Yeah,I just need to connect a value_ head (MLP) to the policy model instead of the ref model.
Why do you need to define _value_model in the policy, I think you can use _ref_model plus _value_head to get the value, at least 1/3 of the parameters and backward gradient overhead are reduced in the GPU memory.