b-inary / desktop-postflop

[Development suspended] Advanced open-source Texas Hold'em GTO solver with optimized performance
GNU Affero General Public License v3.0
234 stars 59 forks source link

GPU Support #7

Closed tnfru closed 8 months ago

tnfru commented 1 year ago

In my Ryzen 7 3700X environment, the following evaluate method consumes about70% of the execution time.

https://github.com/b-inary/postflop-solver/blob/8eefab68e5e9b839504ceac9aa7f2d136f656889/src/game.rs#L177-L183

To my knowledge, we cannot expect GPUs to improve performance because this method itself is neither suitable for parallelization nor GPU processing (for example, it requires a sequential process for showdown). Therefore, we are currently adopting an approach to perform independent evaluations in parallel. So, at present, the answer is no.

Edit: I recall having read a paper that claims to have accelerated the evaluation with GPUs. However, I have no idea how to do it at present. https://arxiv.org/abs/2007.10442

Originally posted by @b-inary in https://github.com/b-inary/desktop-postflop/issues/1#issuecomment-1369692222

I'm not familiar with rust itself, but I work in ML and skimmed through the paper. It looks more about running a specific kind of neural network as agent than about the parallelization of the depth tree. I think about the possibility of GPU support like this: After you build the tree there are N bottom leaves, which need to be calculated to calculate the previous moves. These are the last moves done by the last player before the pot ends in any way (Showdown or Fold). How expensive is this computation? We can assign GPU cores to calculate the values of each leaf and then recursively use them to build the previous ones. This largely depends on the operation of the leaf. Can you elaborate on that? I checked the code but I think I'd need to take a deep dive to get that information.

b-inary commented 1 year ago

I am not an expert in either machine learning or GPU computation, so any advice would be appreciated.

Yes, this paper is mainly about developing a strong agent, with only a brief mention of GPU computation. However, I could find little decent literature on GPU computation of the CFR algorithm.

The rough outline of each iteration of the CFR algorithm is as follows (sorry if you already know the CFR algorithm):

  1. Both players have strategies at the current time. At each time, alternating players are selected to update their strategies.
  2. (top-down) Compute the counterfactual reaching probabilities of leaf nodes, i.e., the opponent's contribution to the reaching probabilities. These values are calculated from the current strategy.
  3. (leaf evaluation) Compute the counterfactual values of each hand of the updating player. The counterfactual value is the sum of the product of the counterfactual reaching probability and the player's payoff for each opponent's hand.
  4. (bottom-up) Update the strategy by using the calculated counterfactual values.

Note that the event of dealing the private cards is not included in the game tree. All possible hands are processed together at the leaf nodes. This allows us to reduce the computational complexity: the current implementation takes O(m+n+c)-time for each public action line, whereas a naive implementation requires O(mn)-time (where m and n are the numbers of possible hands of both players, and c = 52 is the number of cards in a deck).

This is an overview of the process, but please ask me if you want more details.

tnfru commented 1 year ago

Thanks for the in depth explanation. I''m looking into a few papers I found and will evaluate them. Will update afterwards!

Thomas-MMJ commented 1 year ago

These of any use?

https://core.ac.uk/download/pdf/143401218.pdf

https://github.com/janrvdolf/gpucfr

tnfru commented 1 year ago

Both look pretty good. There would also the way of training an AI to imitate the solver results - after enough training it would be blazing fast, but that's probably out of the scope of this project.

Are you interested in implementing the paper / transforming the repo to rust?

Thomas-MMJ commented 1 year ago

I'm not a good candidate for implementing these - lack both rust and GPU specific knowledge (my GPU usage is generally via pytorch API...)

b-inary commented 1 year ago

I don't know if I would implement it, but if I do, I don't think I would use Rust GPU or Rust CUDA. Instead, I would create a library in CUDA C++ and link it with cc-rs. Although the documentation states that only the GNU/Clang toolchain is supported, I have confirmed that I can run a sample program with the Windows MSVC toolchain.

tnfru commented 1 year ago

I don't know if I would implement it, but if I do, I don't think I would use Rust GPU or Rust CUDA. Instead, I would create a library in CUDA C++ and link it with cc-rs. Although the documentation states that only the GNU/Clang toolchain is supported, I have confirmed that I can run a sample program with the Windows MSVC toolchain.

That sounds very interesting, sadly I won't be much of a help in C++.

dpmaloney commented 1 year ago

Unfortunately, using GPU's to speed up CFR is tricky even when you are using ML methods that employ them already because you need to load all the data onto the GPU, and then once it's done unload it all, which is a massive bottleneck in an application like this that already doesn't have good ways to manage the cache (The gpucfr implementation uses a tiny game tree, and if my memory serves me, it keeps the entire tree in the GPU memory so this bottleneck doesn't exist). I have considered trying to use the GPU for terminal node evaluation since it might be able to leverage some GPU tricks, but it would probably require a huge amount of work to get any real performance benefit.