nfsp_agent samples best-response instead of average policy

datamllab / rlcard

Reinforcement Learning / AI Bots in Card (Poker) Games - Blackjack, Leduc, Texas, DouDizhu, Mahjong, UNO.

http://www.rlcard.org

MIT License

2.88k stars 627 forks source link

nfsp_agent samples best-response instead of average policy #50

Closed ghost closed 4 years ago

ghost commented 4 years ago

It looks like nfsp_agent samples the best-response network in evaluation mode. I copied this behavior in the PyTorch implementation. However, Theorem 7 in [1] argues that it is the average strategy profile that converges to a Nash equillibrium. Sampling the best-response network produces a deterministic pure strategy, while the average policy network produces a stochastic behavioural strategy. This is discussed in Section 4.2 of [2]. Also, it looks like DeepMind's implementation [3] samples the average policy network in evaluation mode.

Am I missing something?

References: [1] Heinrich et al. (2015) "Fictitious Self-Play in Extensive-Form Games" [2] Heinrich and Silver (2016) "Deep Reinforcement Learning from Self-Play in Imperfect Information Games" [3] Lanctot et al. (2019) "OpenSpiel: A Framework for Reinforcement Learning in Games"

daochenzha commented 4 years ago

@mjudell Thank you for pointing out. And thank you for the efforts for PyTorch implementation. We are currently reviewing your implementation.

Yes. It is the average policy rather than best response that converges to Nash Equilibrium.

The current code uses best response because we found that best response performs equally well (or seems even better) in our preliminary experiments. Thus, we leave best response there. It may be caused by suboptimal hyperparameters, the network structure, or the evaluation metrics since we currently use tournaments and human-interfaces playing experiences for evaluation which may be biased. We will soon implement exploitation metric, which should be better.

From a scientific view, we should definitely use average policy as in the original paper. We will correct the codes and do more experiments with average policy and exploitation metric. Thanks:)

daochenzha commented 4 years ago

@mjudell Section 4.2 in [2] shows that self-play deterministic, greedy strategy (such as DQN) will lead to highly exploitable policy. It is also confirmed in our experiments. We found that although DQN performs well against random agents, it looks worse than NFSP when they play against each other.

It seems that the best-response policy in NFSP also delivers reasonably good performance (better than DQN). It makes sense to me since the best-response policy is mixed with the slowly changed average policy in training.

Correct me if my understanding is not correct :)

ghost commented 4 years ago

Sounds right to me, and it looks like Figure 2(a) in [2] also finds that the best response outperforms the average policy against SmooCT.

daochenzha commented 4 years ago

@mjudell Thanks for letting me know. I didn’t notice this figure. It seems to be better if using an argument to decide whether average policy or best response should be used for evaluation since they both seem to work.

Addressed in https://github.com/datamllab/rlcard/commit/7df28e587a17ae8fa6dc26c13cf2b098ca147d62