Closed daochenzha closed 3 years ago
@daochenzha is there any literature on efficient best response calculation? I've skimmed [1] and [2] but they are rather difficult to understand. I could port the implementation in OpenSpiel but it would be nice to a have a high level description of what they're doing.
[1] Zinkevich et al. (2007) "Regret Minimization in Games with Incomplete Information." [2] Johanson et al. (2011) "Accelerating Best Response Calculation in Large Extensive Games."
@mjudell The idea of best response is, given a policy, what response to the policy will win the most over this policy. In the OpenSpiel implementation, they traverse the whole game tree and obtain a value on each node recursively. Value means the expected payoff.
There are three kinds of nodes. For the terminal node, we use the payoff for the game as the value. For the node of the best responder, we choose the best action that can obtain the highest value. For the node of the other player, we use the weighted sum of the values of the child nodes. The weights are based on the given policy. Thus, the value of the node essentially means the expected payoff if the opponent behaves with the given policy and I behave optimally.
The best response is a way to calculate exploitability. If the value of the best response (high exploitability), it means the policy has a lot of weaknesses. Hopefully, I explain it well :)
Since the calculation needs to traverse the whole tree, exploitability may be infeasible for very large games. [2] actually proposed several tricks to accelerate the above calculation.
A little while back I reached out to an open_spiel contributor, who shared some super helpful context:
I can try to help to describe the best response algorithm though. It is basically the same as the backward induction best response algorithm from perfect information games if you're familiar with it. If not, it might be best to start with that algorithm. For imperfect information games, we simply extend this to allow chance transitions (by taking the expectation over outcomes) and to allow for the true state of the world to be hidden. To deal with the hidden information, you simply need to keep track of the possibility that the responder reaches each terminal state if they were to take any action, which depends on the strategies of the other players. In fact, once the other strategies are fixed, the game becomes a Markov decision process (MDP), where the state transition probabilities depend on the other player strategies and chance, and hidden information is just resolved through a chance transition at a terminal state. The best response algorithm is then really just dynamic programming on this induced MDP.
There are a few enhancements one can make to this general algorithm depending on the game's structure, but perhaps the most crucial and unfamiliar one is that sometimes you can evaluate the expected value of information sets/states (sets of world states that cannot be distinguished by a particular player) faster than inspecting all information set combinations across players. For example, in a showdown in heads-up poker, you would usually compare each of player 1's potential hidden cards with each of player 2's potential hidden cards. But if you instead sort the possible hands by their strength, you can evaluate the expected value with just another linear pass over the set of hands, reducing the complexity to O(n log n). This is described in more detail in the Efficient Terminal Node Evaluation section.
Literature on extensive-form games can seem unfamiliar and daunting if you haven't seen it before, even though the underlying concepts are often intuitive and straightforward once you get a handle on the notation and core concepts. I would recommend looking at some of the more recent theses published by the U of A poker group. In particular, I would try Neil Burch's and Rich Gibson's. You can also try this tutorial on CFR.
@mjudell Looks cool. Thank you for letting me know!
Implement best response with step and step_back