Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
16.61k stars 4.09k forks source link

A general question on multi-agent reinforcement learning #155

Closed kwea123 closed 6 years ago

kwea123 commented 6 years ago

Hello,

First of all thanks for your great work on this project. I would like to ask a question about multi-agent reinforcement learning as one of your environment, Tennis, actually employs this scenario.

From what I understand, in multi-agent scenarios, we are trying to find (or to approach) a Nash equilibrium (NE) of the game. In case we have only two adversarial agents against each other, we can model the scenario as a two player zero sum game, by setting rewards such that when one agent gets a positive reward, the other gets a negative one with the same absolute value. There are many interesting properties of these kind of games that are proven, for example that a mixed strategy NE always exists, and all NE have the same payoff values.

To find a such NE, we can use self plays (like in Alphago), and finally we get a (approximated) mixed strategy NE profile and the (also approximated) payoff matrix.

My question is, is my understanding correct, that in multi-agent scenarios (and in self plays) we're trying to find out each player's strategy profile in a mixed strategy NE (which is what the policy network does), and the value network corresponds to the payoff matrix?

Also, If so, how do we deal with continuous actions? In discrete action space we can approximate all sorts of distributions, but in continuous space, we usually model the action with a gaussian distribution, which is very restrictive here. For example if the mixed NE strategy requires 1/2 probability of playing "1" and 1/2 of playing "0", then as far as I know, no traditional distributions can even approach this.

Any feedback is appreciated. Thank you.

kwea123 commented 6 years ago

I did some experiments on simple games (discrete action spaces) using PPO self plays, results are as follows:

  1. Matching pennies (https://en.wikipedia.org/wiki/Matching_pennies) The mixed strategy NE is both player playing each face with probability 0.5 1

  2. Paper scissor stone The mixed strategy NE is both player playing each move with probability 1/3 = 0.333 2

However, if I tweak the rewards of matching pennies game to be as follows:

---- 0 1
0 2,-2 -1,1
1 -1,1 1,-1

(only the upper left reward is modified) Then after calculation, the mixed strategy NE should be 2/5 of playing 0 and 3/5 of playing 1 for both players. However, using the exact training algorithm, the strategy still converges to 1/2 of playing each face, which contradicts my previous conjecture. Need help here...

vincentpierre commented 6 years ago

Can you look at the rewards ? Are the rewards as high as you expect ? Can you do better than the trained agent if you implement the heuristic brain?
Are you using one or two external brains ? If both agents share the same brain, I would first try a symmetric reward table for the matching pennies.
Since this game is mixed strategies, I think it would not necessarily converge to the optimal since the agents have multiple attempts. It is not a one time game in this setting, but I have not looked into this game in details. Your stuff looks amazing!

kwea123 commented 6 years ago

Hi, I found PPO not working at all here, the images I got was totally coincidence. I did some research, and I tried out the most simple solution, which is trying to maximize the expected rewards.

I use a single neural network to be used by the agents, which takes the state as input and output the (discrete space) actions' probabilities. Then I feed the all action combinations and the corresponding rewards (i.e. the payoff matrix). Each agent (for now there are only two) modifies his strategy according to the other's and updates the neural network.

I proved mathematically that this is going to converge to the NE strategy in a e^(-x)cos(x) way, and after trying on multiple different games, including matching pennies, I found this working quite well. Following is an image showing the "tweaked" matching pennies training : 1 It converges well to the NE, where each player should play face 0 with proba 2/5=0.4 as I stated previously.

In games where each player has more than two choices, this also works theoretically, although I've only tried on normal paper-scissors-stone game so far. Following is the training process : 12

This also converges in e^(-x)cos(x) way and the optimal value is 1/3.

However, I noticed that it's somehow difficult to always get the convergence, although mathematically proven; I need to tune the learning rate and the number of updates correctly, otherwise it can only oscillate around the optimal value and never converge.

Finally, I think the reason why PPO doesn't work here is that the game is too small here, so the critic network quickly converges to what is not optimal; and since we encourage exploration by adding an entropy term, we encourage the proba distribution to be equally distributed (so 1/2 1/2 in two-choices games). The quick convergence of the critic network and the encouragement exploration make the advantage and ratio = pi/old_pi small respectively, and the not-moving strategy further accelerates the convergence of the critic net. As a result it can only learn the equal probability strategy (every choice with the same probability), and it thinks that the reward is the best. That's why after tweaking the rewards, it still doesn't move away from 1/2 1/2 proba distribution.

I wonder how's your opinion, and if you have any suggestion on papers that I can read; at the same time, I want to try my method on tweaked paper-scissors-stone to see if it still converges, and maybe try to add a third agent, and so on... I will keep you updated if I have any progress.

Here's my ipynb on this little project: https://github.com/kwea123/RL/blob/master/ai/unity_test/matching_pennies/two_player_two_option_games.ipynb

Thanks for reading.

kwea123 commented 6 years ago

Tests on 2 player 3 choice games, seems working well :: https://github.com/kwea123/RL/blob/master/ai/unity_test/paper_scissor_stone/pss_game.ipynb

However, I noticed that this method fails on shapley's game, mathematically it won't converge... but works well on two player zero sum games.

I found a list of multi agent papers here: https://github.com/LantaoYu/MARL-Papers

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.