Description This PR implements Asynchronous Proximal Policy Optimization (APPO) for the Petting Zoo pong environment. The objective is to achieve performance levels that are comparable to those mentioned in the literature.

Related Issue

150

Change Made

[x] Implement APPO sample producer for self-training, human demo, and human feedback
[x] Add model registry to SampleProducerSession in order to compute the values and log-likelihood for a rollout
[x] Add rollout buffer as well as replay buffer for APPO
[x] Add jupyter notebook associated with the python scripts that enable running the Cog-Verse framework using SageMaker i.e. cloud
[x] Refactor update_parameters and compute_gae
[x] Refactor HumanReplayBuffer in order to handle the addition of multiple observations at once
[x] Enforce the code by adding type hint in order to increase the code tractability in VS code

Additional Notes

Performance of APPO on Pong (average reward ~2000) is slightly improved compared to the previous PPO version but it could not reach the same performance that we aim (~4000)
I decided to merge the current branch to main eventhough the performance issue are not resolved because we'll need all these features to be ready for AAMAS 2023 on May 29

closes #150

cogment / cogment-verse

150 make selfplay pong work #158

150