Open arr28 opened 10 years ago
Design ramblings
Also consider whether the score for a move should be rounded up to the highest score for any move that's within 1 standard deviation (say) of the chosen move. Thus, if the raw scores were 49.9, 49.9, 49.9, 50.1, 50.1, 50.1 (as they might well be for an early move in Hex, Breakthrough, etc.), you wouldn't get a normalised score of 0 just for choosing one of the first 3 options.
There is a complication here which means that just changing the bias between exploration and exploitation won't be sufficient. Specifically we also need to change how we handle complete nodes being propagated up the tree in light of opponent predictability. Canonically consider 3 player C4 where a line leads to a win for us next move. The node of the winning move itself will be complete (score 100 for us, 50 for the previous player, 0 for the third player). Suppose now the player before us (so the one considering the parent node) has a generally bad position, and indeed that every move EXCEPT letting us win lets the third player win. Because all children of that decision node are complete (for the deciding player our win has a score of 50, the others have a score of 0) the completeness propagation will mark IT as complete with the best choice (i.e. - the win for us). This will then propagate up again and end up looking like a determined win for us higher up the tree. However, if the player before us is irrational they may choose one of the other choices (that lead to the third player win). Consequently the completeness propagation also has (at least ideally) to be rationality-sensitive.
Yes, I don't really like the way we do completeness at the moment. I've already got an issue open for it (taking it from a slightly different angle, but I'll fold this in).
Initially, I'm going to ignore the interaction of predictability and completeness. I'm planning to take predictability first. At some later date I'll take completeness. We can then revisit and see if there's still a problem in this area.
For initializing opponent predictability / learning it across matches, the Tiltyard now gives additional information in GGP headers.
GGP-Match-Player-Count: 2
GGP-Match-Player-0: GreenShell
GGP-Match-Player-1: LabOne
Some opponents don't do what we expect them to do. This might be because they're much weaker than us (Random / most Coursera students) and are making poor moves, or because they're stronger and have seen a better move.
Either way, we can improve our play against such opponents by increasing the "exploitation" weight when evaluating nodes which are our opponent's choice.
(Currently we use a fixed weight in all opponent nodes, but we can do better than that by modelling opponent predictability.)