FLAIROx / JaxMARL

Multi-Agent Reinforcement Learning with JAX
Apache License 2.0
393 stars 68 forks source link

Baselines for STORM (PPO working with individual rewards instead of team rewards) #74

Closed Ueshima73 closed 4 months ago

Ueshima73 commented 5 months ago

Hello. Thank you for the incredible work! I am wondering if you have any plans to add baselines for the STORM environment. I am trying to code a simple iterated PD game with IPPO first, but I am having difficulties doing so in the JAX style, partly because the STORM is different from others with baselines in that agents need individual rewards instead of team rewards.

I would greatly appreciate any insights on how to get started. Thank you so much.

alexunderch commented 5 months ago

Do you want to have a baseline of a storm game with PD payoff matrix trained with IPPO, did I understand correctly?

Aidandos commented 5 months ago

Not sure this addresses your issue but check out Pax: Scalable Opponent Shaping in Jax, from which the STORM environments were adapted. We have the IPD there on which you train independent PPO. Just a heads up, it's not nearly as cleanly documented as JaxMARL so let me know if you run into any problems.

Ueshima73 commented 5 months ago

Hi @alexunderch, yes, that's correct. Though I am planning to extend the (hopefully) added baseline to more complicated games, I would really appreciate it if I could see an example of the simplest one.

Hi @Aidandos, yes! I actually found your amazing work today! I just thought it would be great to see a JaxMARL version because of my lack of coding skills to navigate the codebase, but I will certainly check it out more. Thanks also for notifying me about the context of the development.

alexunderch commented 5 months ago

I can try help you shortly

Ueshima73 commented 5 months ago

Hi. I must confess that I am still having difficulties completing this. I found that my main issue is enabling PPO to work with individual rewards instead of the team reward using JaxMARL.

I would highly appreciate it if anyone could share a hint for writing a PPO that works with individual rewards, regardless of the environment. For example, a pseudo script based on JaxMARL_Walkthrough.ipynb would be highly appreciated.

alexunderch commented 4 months ago

Hey! I dont maintain the library, so I can't always find time to do things, sorry!

Let's start working. I prepared starters' colab with IPPO agent working with all the agents controlled by one set of parameters, using CNN to extract features. Here is the link: link. You can start with one as your pseudocode.

Next, I plan to incorporate inverntory to feature extraction and make each agent with its own ppo. Comment or ask if you have some difficulties to work through!

alexunderch commented 4 months ago

I think that all rewards in the environment are individual: each agent is rewarded in pairwise interactions specified by a payoff_matrix.

Ueshima73 commented 4 months ago

Hi @alexunderch, thank you for your response again! I really appreciate the introductory Colab notebook you shared. Please let me play around with it starting today. I will close this issue for now and might ask you a question later.

Note: I think I misunderstood the JaxMARL_Walkthrough.ipynb regarding how IPPO works. I somehow thought the network is updated based on team rewards (i.e., episode return), but this is not the case. I just wanted to clarify that I was confused in the above comments.

Thank you again @alexunderch and the team!