Make selfplay pong work

cogment / cogment-verse

Research platform for Human-in-the-loop learning (HILL) & Multi-Agent Reinforcement Learning (MARL)

https://cogment.ai/cogment_verse

Apache License 2.0

76 stars 14 forks source link

Make selfplay pong work #150

Closed cloderic closed 1 year ago

cloderic commented 1 year ago

Goal Train an AI agent using PPO with replay buffer to achieve good performance on the Pong game in the Petting Zoo environment

Acceptance Criteria

[ ] Trained AI agent is able to play Pong game effectively, achieving an averaged score higher than 3000 steps
[ ] PPO algorithm is successfully integrated into the cogment-verse training pipeline
[ ] Write a detailed README file that includes instructions on how to run the code and how to reproduce the results

lhnguyen102 commented 1 year ago

@cloderic I don't think the following code in the sample producer represents the terminal condition of the environment i.e. done

done = sample.trial_state == cogment.TrialState.ENDED

The environment and sample producer can run asynchronously, potentially ending trials before all samples are collected. In an experiment with Pong games, I manually set the terminal condition to end after 11 steps, but the sample producer reported the environment had ended after only 3 steps due to that line of code. This caused incorrect results, as all computations involving the environment's terminal conditions were needed for PPO with replay buffer What do you think?

cloderic commented 1 year ago

This will need investigation

potentially ending trials before all samples are collected.

This shouldn't happen, if it does it's a bug in the environment, the orchestrator or the datastore.

Just so that I understand, do you have the expected behavior if the trial ends "on its own" (basically if the environment ends it) ?

lhnguyen102 commented 1 year ago

Yes, I do. I manually set the final observation to zeros when the environment ends. However, it did not provide this final observation in the sample producer when done is true. FYI: I only ran with a single trial for this test in order to debug the issue. I will investigate further on this issue

cloderic commented 1 year ago

The way it should work is described here -> https://cogment.ai/docs/guide/development-guide#trial-end

LailaElMoujtahid commented 1 year ago

@ha --> performance not good yet + coordinate with @wduguay-air