datamllab / rlcard

Reinforcement Learning / AI Bots in Card (Poker) Games - Blackjack, Leduc, Texas, DouDizhu, Mahjong, UNO.
http://www.rlcard.org
MIT License
2.88k stars 623 forks source link

Getting started with rlcard #94

Closed brupelo closed 3 years ago

brupelo commented 4 years ago

Hi, first of all, I'm asking this as a total amateur in the ML field... other than some little course of university years ago about neural networks my knowledge about modern ML is pretty limited (not my area of expertise).

Thing is, I was researching about whether incomplete information games such Texas Holdem were solved nowadays and I've seen attempts such as libratus, pluribus and I've got curious... It's seems reinforcement learning is the trend to solve incomplete information games.

Ok, let's say I wanted to create a strong Texas Holdem using rlcard framework, here's some questions:

Really interesting project... I'll read the docs from start to end so please assume when you answer I've already done so :)

Offtopic question: I see the docs have been generated using sphinx's https://github.com/rtfd/sphinx_rtd_theme but I don't see any script available to generate the online docs locally... am I missing something? Is there any other repo that contains the scripts/templates?

daochenzha commented 4 years ago

Hi, @brupelo Thanks for your interest! I would suggest starting from Leduc Holdem (which is a simple version of Texas Hold'em).

First of all, currently, reinforcement learning is still not as good as libratus, pluribus in Texas Hold'em. However, reinforcement learning is more efficient since it is a sampling-based method and more general. I believe with more research efforts, the gap can be narrowed.

  1. There are two categories of algorithms. The first is CFR-based. We only implement basic CFR in this repo. The basic CFR is not efficient enough for Texas Holdem (but it can deal with Leduc Hold'em well). There are lots of variants which you can find by simply google CFR. The second category is based on reinforcement learning. We have implemented NFSP from this category. It usually does not perform as well as CFR in small games, but it is more efficient in large games. I believe through careful tuning (hyperparameters, state representation, action design, and reward design), NFSP could have good performance in Texas Hold'em. We did not fine-tune these kinds of stuff in this repo since our focus is to provide an easy-to-use environment.

  2. More players would make the game much more difficult. I would suggest starting from more players in Leduc Hold'em.

  3. We have some examples in /examples on how to start training and save models. It usually has good performance on Leduc Hold'em. For other games, it has some improvement but needs further tuning. Yes, we support multi-process. But due to the recent update of interfaces, it has some issues (we are working on these). We do not support database.

  4. There are several ways to know whether the agent is improving. The easiest way is to launch tournaments with a random agent, which is the default setting in examples. This will give a sense of whether the agent is improving, but it can not be a formal evaluation metric. A usually adopted metric in the publication is exploitability (we are implementing this, some issues remained). This metric measures the weakness of the agent. This metric is accurate, however, hard to use in large games since computing this metric itself is too expensive in large games. We also support a third way. We have implemented several rule-agents in /models. We can know the performance by launching tournaments with the rule-based bot. Also, we can use the human interface to analyze the behaviour.

  5. Rule agents may mimic human behaviour, since it is designed based on human knowledge.

Yes, the templates are in another place to make rlcard light-weighted. It is in here https://github.com/rlcard/rlcard.github.io

For the document, I would recommend the ones in /doc in this repo. We are updating the documentation. The one on the website is a little bit old.

Hopefully, I addressed all of your questions.

brupelo commented 4 years ago

@daochenzha Thank you very much for all the insightful information, that really helps.

  1. Interesting, so the whole project would revolve around the idea from this paper https://arxiv.org/pdf/1811.00164.pdf ... It seems is quite up-to date with respect to the provided references (ie: https://arxiv.org/pdf/1811.00164.pdf), would it be a good starting point to read in order to understand the basics of rlcard? I'm curious though... you say there are 2 algorithms, CFR & NFSP and I wonder... are they using an existing architecture from https://www.asimovinstitute.org/neural-network-zoo/ or it's just a new one?
  2. I've played quite a lot of NLTH in the past and I wasn't aware about Leduc's, it seems the level of complexity is smaller than NLTH
  3. Ok... but, let's say I've been training my bot for few hours and I stop the training, I guess I could resume the training in another computer, right? Why do I ask about scaling up training with multiple servers? No idea what the algorithm maths are but for similar algorithms like unbiased monte carlo estimators sampling you can theorically do network parallel processing easily (assuming they use different initial seeds) and merge the results.
  4. Really interesting... would it benefit uploading the bots created by rlcard's users to some repo? And maybe creating tournaments with such a bots? Would that help further the training or it'd be nonsense... ie: you don't want to add any noise to the solution, right?
  5. Kinda related to 4... theorically speaking, let's say you've trained your bot using proper algorithms (let's call it good bot) and I give you a bot that mimicks human behaviour (let's call it bad bot) whose play style is poor, would the fact of training good bot vs bat bot affect negatively to good bot? Why am I asking this? I assume you want to expose bot to as much heterogeneous data as possible (ie: real-world cases)

Anyway, I'll definitely give it a shot to the project... it's really caught my attention, really cool stuff, looks fun... I've wanted to create a go (something like alphago) bot for ages but never found the time to learn about it. It seems trying to solve these card games is a more "feasible" task :)

daochenzha commented 4 years ago

Thanks for the interest. CFR & NFSP are two basic algorithms that represent two different ideas to solve card games. CFR does not use neural networks and only deals with the tabular case. DeepCFR is a kind of combination of CFR and neural networks and CFR. NFSP uses neural networks. The network architecture could be in any form you mentioned in the link. To fully understand the idea of CFR, it would be better to refer to the original paper (with lots of maths): https://poker.cs.ualberta.ca/publications/NIPS07-cfr.pdf

To get a sense of reinforcement learning in card games, NFSP paper is a good starting point https://arxiv.org/abs/1603.01121

For neural networks, yes, we can perfectly continue training. However, for off-policy reinforcement learning algorithms, there is an issue. In addition to the weights of neural networks, algorithms like Deep-Q Learning need to maintain a replay buffer which stores lots of historical data. To fully continue training, we also need to save the data in the buffer and these data could be very large. Our idea of parallelization will mainly focus on data generation (since the simulation in the game engine often takes more time). Reinforcement learning algorithms can also be running in parallel. There are many papers studying parallel reinforcement learning which could be helpful.

Uploading more models will definitely be helpful. We have some pre-trained models and rule-models in /models. These models can be directly imported and compared. Users could also upload their model here as a baseline for comparison in the future. For your point 5, I think your concern totally makes sense. By exposing bot to more heterogeneous data, it could generalize better. NFSP uses the idea of Fictitious self-play, that is, the agent is trained to play against its own average behavior. See https://arxiv.org/abs/1603.01121

Thanks again for your interest. Card games are very challenging. I believe there are a lot of things to be explored in card games.