HumanCompatibleAI / overcooked_ai

A benchmark environment for fully cooperative human-AI performance.
https://arxiv.org/abs/1910.05789
MIT License
683 stars 144 forks source link

Understanding Training and Self-play agents #128

Closed nil123532 closed 1 year ago

nil123532 commented 1 year ago

Hello,

Firstly, thank you for providing such a comprehensive GitHub repository on multi-agent RL. I'm new to the field of Reinforcement Learning and had some questions regarding the project:

In the human_aware_rl/ppo directory, it appears that a PPO agent is trained alongside a pre-trained Behavioral Cloning (BC) agent. Could you provide some guidance on how to modify this setup to train two PPO agents together, similar to the approach taken in PantheonRL?

The human_aware_rl/imitation directory suggests that a BC agent is trained using previously collected human data. Could you confirm this?

I'm particularly interested in understanding which of these setups qualifies as self-play. My assumption is that the first case might be considered self-play, but given that one agent is a BC agent, I'm not sure if this meets the traditional definition of self-play, such as the approach used in PantheonRL where you can train a PPO ego agent and PPO alt agent in stable-baselines3.

Thank you for your time and looking forward to your response.

Best regards

micahcarroll commented 1 year ago

Hi! Thanks for reaching out!

This file runs all the experiments in true self play, and this one with PPO-BC. I believe the python file automatically figures out which type of training run you're trying to run based on the arguments you're passing in – you can verify this yourself following the execution starting from this file.

Training with BC is called BC-play, PPO-BC, or human-aware-RL (as in our paper). I'd recommend reading it for more intuition on the various setups! You're right that any of these do not count as self-play.

Yes, the BC agents in the imitation directory by default uses human gameplay data that we collected. Again, I encourage you to double check the code as to how things are done exactly.

Let me know if you have any other questions!

nil123532 commented 1 year ago

Thank you for the quick and detailed reply; it clarified many aspects for me. I've gone through the scripts and observed how the sacred library manages experiment settings based on passed arguments, which is quite impressive.

I have a couple more questions I'd like to explore:

In the context of PPO-self-play, are both agents being trained, or is it just one? In other words, does each agent have its own distinct policy, or is there one unified policy that both agents follow? My understanding is a unified policy right?

If I've understood correctly, the agents are initialized in the constructor, and the joint_action variable in step method is used to step through the environment, receiving rewards in return. Could you confirm if my understanding is accurate?

I'm thoroughly impressed by your work and eager to understand it more deeply.

Thank you again for taking the time to assist me.

micahcarroll commented 1 year ago

For self-play there is only one neural network which is used to parameterize both agents' policies. That is, effectively there is only one policy (both agents behave identically), but the experience of both agents is used to update the common policy.

If I've understood correctly, the agents are initialized in the constructor, and the joint_action variable in step method is used to step through the environment, receiving rewards in return. Could you confirm if my understanding is accurate?

Yeah I believe that's right (it's been a while since I looked at the code closely)

On Sun, Aug 27, 2023, 11:06 PM nil123532 @.***> wrote:

Thank you for the quick and detailed reply; it clarified many aspects for me. I've gone through the scripts and observed how the sacred library manages experiment settings based on passed arguments, which is quite impressive.

I have a couple more questions I'd like to explore:

In the context of PPO-self-play, are both agents being trained, or is it just one? In other words, does each agent have its own distinct policy, or is there one unified policy that both agents follow?

If I've understood correctly, the agents are initialized in the constructor, and the joint_action variable in step method is used to step through the environment, receiving rewards in return. Could you confirm if my understanding is accurate?

I'm thoroughly impressed by your work and eager to understand it more deeply.

Thank you again for taking the time to assist me.

— Reply to this email directly, view it on GitHub https://github.com/HumanCompatibleAI/overcooked_ai/issues/128#issuecomment-1695071179, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXN5K3UACBFEX6MV5UAPM3XXQYPBANCNFSM6AAAAAA4ADBXWM . You are receiving this because you commented.Message ID: @.***>

nil123532 commented 1 year ago

Ahah! Thank you so much!