Opponent predictability modelling

arr28 commented 10 years ago

Some opponents don't do what we expect them to do. This might be because they're much weaker than us (Random / most Coursera students) and are making poor moves, or because they're stronger and have seen a better move.

Either way, we can improve our play against such opponents by increasing the "exploitation" weight when evaluating nodes which are our opponent's choice.

(Currently we use a fixed weight in all opponent nodes, but we can do better than that by modelling opponent predictability.)

arr28 commented 10 years ago

Design ramblings

Each opponent has a predictability value, ranging from -1 (they always choose the move which we thought was worst), through 0 (entirely unpredictable) to 1 (they always choose the move which we thought was best).
The predictability value is updated each turn.
To compute the predictability score of a single move, first normalise our assessment of the scores for each move into the range -1 (for the worst score) to 1 (the best score), in a linear fashion. For example, if the raw scores were A=40, B=60, C=45, D=50, E=40, the normalised scores would be A=-1, B=1, C=-0.5, D=0, E=-1. Then, simply pick the the normalised value corresponding to the move they made.
The information gained from an opponent's move must depend on how many choices they had. In particular, note that if they didn't have a choice then we haven't gained any information at all. (Or to put it another way, no-ops shouldn't cause us to increase the predictability of an opponent.) The amount of information gained affects how far we move the predictability towards the score from this turn. How far we move the predictability presumably also depends on how many data points we already have.
More recent predictability data points have more weight than older data points. This is important in large games (e.g. Hex) where it's likely that both players are playing effectively randomly early on because they're just getting noise back from the Monte-Carlo simulations due to the large state space. Later though, they're likely to pick up the scent and start playing predictably.
When using the predictability information, we use a floor of 0 - i.e. we don't make any attempt to exploit negatively correlated players. A predictability of 0 causes us to put no weight on exploitation and 100% weight on exploration. A predictability of 1 causes us to use the same weights as we do for ourself. I haven't thought about how to vary it between those limits.

Also consider whether the score for a move should be rounded up to the highest score for any move that's within 1 standard deviation (say) of the chosen move. Thus, if the raw scores were 49.9, 49.9, 49.9, 50.1, 50.1, 50.1 (as they might well be for an early move in Hex, Breakthrough, etc.), you wouldn't get a normalised score of 0 just for choosing one of the first 3 options.

SteveDraper commented 10 years ago

There is a complication here which means that just changing the bias between exploration and exploitation won't be sufficient. Specifically we also need to change how we handle complete nodes being propagated up the tree in light of opponent predictability. Canonically consider 3 player C4 where a line leads to a win for us next move. The node of the winning move itself will be complete (score 100 for us, 50 for the previous player, 0 for the third player). Suppose now the player before us (so the one considering the parent node) has a generally bad position, and indeed that every move EXCEPT letting us win lets the third player win. Because all children of that decision node are complete (for the deciding player our win has a score of 50, the others have a score of 0) the completeness propagation will mark IT as complete with the best choice (i.e. - the win for us). This will then propagate up again and end up looking like a determined win for us higher up the tree. However, if the player before us is irrational they may choose one of the other choices (that lead to the third player win). Consequently the completeness propagation also has (at least ideally) to be rationality-sensitive.

arr28 commented 10 years ago

Yes, I don't really like the way we do completeness at the moment. I've already got an issue open for it (taking it from a slightly different angle, but I'll fold this in).

Initially, I'm going to ignore the interaction of predictability and completeness. I'm planning to take predictability first. At some later date I'll take completeness. We can then revisit and see if there's still a problem in this area.

arr28 commented 9 years ago

For initializing opponent predictability / learning it across matches, the Tiltyard now gives additional information in GGP headers.

GGP-Match-Player-Count: 2
GGP-Match-Player-0: GreenShell
GGP-Match-Player-1: LabOne

SanchoGGP / ggp-base

Opponent predictability modelling #47