Description
This PR implements Asynchronous Proximal Policy Optimization (APPO) for the Petting Zoo pong environment. The objective is to achieve performance levels that are comparable to those mentioned in the literature.
Related Issue
150
Change Made
[x] Implement APPO sample producer for self-training, human demo, and human feedback
[x] Add model registry to SampleProducerSession in order to compute the values and log-likelihood for a rollout
[x] Add rollout buffer as well as replay buffer for APPO
[x] Add jupyter notebook associated with the python scripts that enable running the Cog-Verse framework using SageMaker i.e. cloud
[x] Refactor update_parameters and compute_gae
[x] Refactor HumanReplayBuffer in order to handle the addition of multiple observations at once
[x] Enforce the code by adding type hint in order to increase the code tractability in VS code
Additional Notes
Performance of APPO on Pong (average reward ~2000) is slightly improved compared to the previous PPO version but it could not reach the same performance that we aim (~4000)
I decided to merge the current branch to main eventhough the performance issue are not resolved because we'll need all these features to be ready for AAMAS 2023 on May 29
Description This PR implements Asynchronous Proximal Policy Optimization (APPO) for the Petting Zoo pong environment. The objective is to achieve performance levels that are comparable to those mentioned in the literature.
Related Issue
150
Change Made
SampleProducerSession
in order to compute the values and log-likelihood for a rolloutcloud
update_parameters
andcompute_gae
HumanReplayBuffer
in order to handle the addition of multiple observations at onceAdditional Notes
closes #150