Clarify how to measure the performance of an agent in a 1 v 1

Swynfel / ceramic

Environment to play variations of the board game Azul, usable for Reinforcement Learning research in C++ and Python

GNU General Public License v3.0

10 stars 2 forks source link

Clarify how to measure the performance of an agent in a 1 v 1 #1

Open RemiFabre opened 9 months ago

RemiFabre commented 9 months ago

Hi, First of all thank you for your work, this is great stuff.

I'd like to compare the performance of 2 agents in a series of 1 v 1 games. For example mc vs fl, I'd expect this would do it:

./ceramic-arena fl mc -g 5

But it generates this:

Mode: All
Played 3/3 (15/15)    
Games per group:  5
Games per player: 30
Total time: 1.7383e+07 µs (real), 1.3906e+08 (times thread count)
Time: 2.581e+06 µs (game), 2.674e+04 µs (step), 1.541e+00 µs (state change)
Average moves per game: 96.5

              player | winrate |  avg  |  std  | move time |moves
---------------------+---------+-------+-------+-----------+-----
         first-legal |   3.33% |  17.2 |  12.1 | 3.803e+00 | 25.7
mc-1000-h(0.2,0.2,0) |  46.67% |  33.6 |  12.9 | 5.710e+04 | 22.6

What are "groups" ? I'd expect only 5 games to be played per player, not 30. And why do the winrates not sum to 100%?

Clearly I did not understand something about the call options, please clarify if there is a way to perform duels. Best,

Swynfel commented 9 months ago

Hello @RemiFabre! Thank you for showing interest in this project

To start - something I forgot to add to the readme - this was presented at a workshop, and the corresponding paper might offer some extra explanations: http://id.nii.ac.jp/1001/00207567/

To answer your question directly, the "default rules" expect a 4-player game. As only two players were given, it has to create groups of 4, and the mode "All" will create all possible groups (well, except the ones with always the same player), namely the following:

fl + mc + mc + mc
fl + fl + mc + mc
fl + fl + fl + mc

The option -g 5 means all groups will play 5 games (thus the Games per group: 5), and each player will therefore play 1×5 + 2×5 + 3×5 = 30 games (thus the Games per player: 30)

Since games are 4-player, the average win-rate should be 25% (looking at the sum in this context shouldn't give 100%)

Lastly for your last question, to perform duels the only thing really needed is to set the number of players in the rules to 2, and for that... I don't think I put an option to do it in the ceramic-arena script... I will try to add that. It should also be pretty easy to do in python, you can use the tests tests/arena/test_arena.py as inspiration, but I will give you a working example as soon as I get this working again (I hope I didn't forget everything)

Swynfel commented 9 months ago

So, that felt like a weird time jump. I want to do some clean-up now (such moving to pyproject.toml module definition), before putting the rest of the project up-to-date (such as adding tests for the latest versions of python)

In any case, I added a quick fix to handle duels in the branch arena-improvements (I will add it to the main branch after doing the clean-up). If you want to run duels / set the number of player to two, you can use -p 2, so for your example, you can do:

./ceramic-arena fl mc -g 5

Which, on my computer, gave me

Mode: All
Played 1/1 (5/5)    
Games per group:  5
Games per player: 5
Total time: 4.6175e+06 µs (real), 3.6940e+07 (times thread count)
Time: 9.233e+05 µs (game), 1.631e+04 µs (step), 1.795e+00 µs (state change)
Average moves per game: 56.6

              player | winrate |  avg  |  std  | move time |moves
---------------------+---------+-------+-------+-----------+-----
         first-legal |  20.00% |  29.6 |   8.5 | 5.319e+00 | 28.8
mc-1000-h(0.2,0.2,0) |  80.00% |  43.0 |  15.6 | 3.320e+04 | 27.8

I mentioned in the above message it was also possible to do the same in python, here is the code (it should work in both branches)

from ceramic.players import MonteCarloPlayer, RandomPlayer
from ceramic.arena import AllArena, PairsArena
from ceramic.rules import Rules

rules = Rules.BASE
rules.player_count = 2
players = [RandomPlayer(), MonteCarloPlayer(rollouts=100)]
arena = PairsArena(rules, players)
arena.count = 10
arena.run()
arena.print()

Tell me if this answers your question

RemiFabre commented 9 months ago

Amazing! Thanks a lot for your answer and your work. I'll dig into this more seriously soon, some questions after a fast read of the paper, apologies if I missed the answer in the paper:

Humans still heavily outperform the best agent coded. This is surprising to me. What approach would you recommend to create a super-human agent?
Given the size of the decision tree, are we far from solving an entire round? By this I mean being able to do a deterministic search (minimax+alpa-beta prunning for example) until the end of the tree, at the scope of 1 round. We would need to give a heuristic function to score the state of a round. Since I play the game a lot and I have theories I'd like to check, I would really love to hand craft several heuristic functions and see what works and what doesn't.

RemiFabre commented 9 months ago

Ok so I read the paper, good work!

In the paper you give some elements of the dimension of the game tree: 1) " The number of legal actions decreases throughout the rounds, but can start at 228". 2) "We can see in Table 6 that, on average, a player will have access to 44.7 legal actions, or 31.6 actions if we ignore the ones that place the tiles directly in the floor when placing them on the pyramid is possible. However, agents may have to choose among larger sets. Empirically, mc*-10 000 has more than 5% chance to encounter more that 100 smart legal actions"

If you have more data on this, I'd be interested in reading it. My intuitions are:

That in a duel setup the game tree is much smaller
That in a round the game tree decreases in size very rapidly with the number of turns played. => Therefore, if we can find heuristics for then opening moves (which I believe we can), maybe it is possible to reach the end of the tree for a round in a reasonable time? And if not, these heuristics could help prioritize the paths to explore first. Eventually the heuristics could be learned and not fed by an expert. Nothing new in this idea, but I'd be interested in your opinion about it feasability in this use case.

Also here is a repo I worked on with some statistics about the game using a modest brute force method: https://github.com/RemiFabre/azul_statistics

RemiFabre commented 9 months ago

Printing the number of legal moves in a game between 2 default MC:

number of legal actions: 108
number of legal actions: 102
number of legal actions: 77
number of legal actions: 47
number of legal actions: 32
number of legal actions: 18
number of legal actions: 15
number of legal actions: 12
number of legal actions: 10
number of legal actions: 7
number of legal actions: 2
Round tree size: 578676084940800
number of legal actions: 71
number of legal actions: 54
number of legal actions: 50
number of legal actions: 31
number of legal actions: 22
number of legal actions: 21
number of legal actions: 14
number of legal actions: 8
number of legal actions: 6
number of legal actions: 5
number of legal actions: 3
number of legal actions: 1
Round tree size: 27674916192000
number of legal actions: 61
number of legal actions: 50
number of legal actions: 35
number of legal actions: 22
number of legal actions: 14
number of legal actions: 9
number of legal actions: 7
number of legal actions: 4
number of legal actions: 3
number of legal actions: 2
Round tree size: 49713048000
number of legal actions: 41
number of legal actions: 35
number of legal actions: 26
number of legal actions: 19
number of legal actions: 10
number of legal actions: 9
number of legal actions: 7
number of legal actions: 5
number of legal actions: 3
number of legal actions: 2
Round tree size: 13398021000
number of legal actions: 39
number of legal actions: 30
number of legal actions: 21
number of legal actions: 19
number of legal actions: 18
number of legal actions: 17
number of legal actions: 12
number of legal actions: 7
number of legal actions: 6
number of legal actions: 4
number of legal actions: 1
Round tree size: 287985559680
number of legal actions: 35
number of legal actions: 32
number of legal actions: 24
number of legal actions: 18
number of legal actions: 10
number of legal actions: 9
number of legal actions: 6
number of legal actions: 4
number of legal actions: 3
number of legal actions: 1
Round tree size: 3135283200

So it's still a tad big :) For now !

Swynfel commented 9 months ago

It seems you have already answered most of the questions yourself ;), but for my input:

Humans still heavily outperform the best agent coded. This is surprising to me. What approach would you recommend to create a super-human agent?

They outperform the best agent in the paper (monte-carlo-based) that, while not that bad, is still less advanced more recent machine-learning based ones. Its main advantage is being a good baseline without immediate exploits

Given the size of the decision tree, are we far from solving an entire round? By this I mean being able to do a deterministic search (minimax+alpa-beta prunning for example) until the end of the tree, at the scope of 1 round. We would need to give a heuristic function to score the state of a round. Since I play the game a lot and I have theories I'd like to check, I would really love to hand craft several heuristic functions and see what works and what doesn't

For a single round in a two-player game (especially the last ones, where there are less possibilities), it might be possible. But you showed the numbers, they are still quite big. It might be possible to reach a very good solution with a smart pruning, either by hand-crafted heuristics, or a machine learning-based one. If the code is clear enough, I hope you'll be able to do that with this package ;)

(There still remains the challenge of having a good heuristic for the end of a round)

If you have more data on this, I'd be interested in reading it. My intuitions are:

That in a duel setup the game tree is much smaller

That in a round the game tree decreases in size very rapidly with the number of turns played. => Therefore, if we can find heuristics for then opening moves (which I believe we can), maybe it is possible to reach the end of the tree for a round in a reasonable time? And if not, these heuristics could help prioritize the paths to explore first. Eventually the heuristics could be learned and not fed by an expert. Nothing new in this idea, but I'd be interested in your opinion about it feasability in this use case.

It seems the numbers showed your intuitions were correct! I believe that introducing a lot of hand-crafted heuristic will likely result in a less-than-optimal policy, but surely it will be better than the baselines currently present. I think that yes, it is feasible to have pretty good results with way, although it's hard to tell now how it would compare to humans. If you're still going for minmax with alpha-beta pruning, solving the end of a round according to an heuristic (and perfectly the last round) could give it an edge

RemiFabre commented 9 months ago

They outperform the best agent in the paper (monte-carlo-based) that, while not that bad, is still less advanced more recent machine-learning based ones. Its main advantage is being a good baseline without immediate exploits

If I may ask, what would be your approach to try to make the best agent possible using the tools we have today?

Swynfel commented 9 months ago

Since we have access to an environment in which to run simulations I would, in the case of 1 vs 1:

Create a Reinforcement-Learning based agent. I feel something from the (deep) Q-Learning family would work well in this case. It depends on the model picked, but in the end we should have either a policy (so a function that returns a score / probability based on a pair of state and action), either an evaluation of a state (the same, but just for the state as input), or both
Improve the runtime performance of this agent by doing a sort of alpha-beta pruning on simulations starting from the current state the game is in, and using the RL agent's knowledge as heuristic. For example: if the agent can return a "q-score" based on pairs of state and action, to know which "actions / branches" the algorithm should explore, I would take the action returning the best q-score first, as well others with a q-score "close enough" (heuristic to choose depending on how much we want to really test everything). If at a certain point, the q-score in the branch is lower than a threshold (the current best minus a margin), we stop the exploration. If at one point the simulation ends, return this actual score instead. Otherwise, we could decide to stop at a certain depth, or went we reach the end of a round, and use the agent's heuristic

I'm of course not sure how well this would work, but I expect it to be a simple straight-forward way in obtaining a very good agent

In the case of 3 or 4 player, the path is less clear for me. The above strategy could work too by considering all opponents behave more or less like the agent (and thus pruning their actions the same way), but if the assumption doesn't hold I would look into Multi-Agent Reinforcement Learning algorithms

RemiFabre commented 9 months ago

Very interesting! Thanks for the detailed answer.