facebookresearch / Pearl

A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.
MIT License
2.69k stars 165 forks source link

Phasic Policy Gradient #106

Open cvnad1 opened 4 weeks ago

cvnad1 commented 4 weeks ago

@rodrigodesalvobraz I would like to know whether Phasic Policy Gradient is implemented https://arxiv.org/abs/2009.04416. If it's not then I would like to try implementing it and add it to Pearl.

rodrigodesalvobraz commented 4 weeks ago

Hi, @cvnad1 . That is not implemented in Pearl. If you do that, that would be much appreciated. Thank you.

cvnad1 commented 2 weeks ago

@rodrigodesalvobraz

Hi, I have been trying to go through Pearl's functions and classes for the past few days. I was especially looking at the PPO implementation as PPG is heavily based on this except for a new auxiliary training phase.

I noticed that PPO is using separate networks for policy and value whereas, both should share a common base with different heads. Am I wrong with this? Did I miss something?

yiwan-rl commented 2 weeks ago

@cvnad1. I think it is fine to keep the policy and value networks separate. Do you have any specific concern about separating them?

cvnad1 commented 2 weeks ago

@yiwan-rl Crct me if I am wrong, but I believe in official PPO implementation by the majority of libraries, the network has a common base, a value head, and a policy head as this gives better results as compared to training them separately.

Of course, it is not wrong to train them separately but this would result in poor performance. PPG exactly addresses this. The authors noticed that separating policy and value networks results in poor performance and keeping a common base results in noise and reduced sample efficiency. To get the optimum of both worlds, they proposed the PPG algorithms as detailed in the linked paper above.

In PPG, we have two neural networks in total,

Policy network -> common base + policy head + auxiliary value head Value network -> Normal vanilla value network

There are two training phases,

Training Phase -> Update Policy network + Update Value Network separately Auxiliary Phase _> Update auxiliary value network using Joint Loss + Update Value network again (Sampling efficiency)

I just wanted to give a summary to you of how I am planning to implement PPG as PPG is just a modification to PPO with some additional losses and training updates.

Again, I will be delighted to hear your thoughts or suggestions that can help me.

yiwan-rl commented 2 weeks ago

Thanks for the explanation. To implement this idea, you could write a new history summarization module that implements the shared base network, similar to this LSTM module https://github.com/facebookresearch/Pearl/blob/main/pearl/history_summarization_modules/lstm_history_summarization_module.py. The history summarization module's goal is to produce a vector, based on past history, that represents the agent's current state, which is the input of the actor and the critic. I think the best place to implement this shared base is there.

cvnad1 commented 2 weeks ago

@yiwan-rl Will check the module and revert back in case of doubts.