facebookresearch / rebel

An algorithm that generalizes the paradigm of self-play reinforcement learning and search to imperfect-information games.
Apache License 2.0
639 stars 108 forks source link

How does chance node work in subgame solving? #20

Open lilith829 opened 3 years ago

lilith829 commented 3 years ago

Hi Anton,

I am not clear about how chance node works in subgame solving at the end of a betting round. Any clarifications would be very helpful. Thank you.

  1. The paper shows that ReBeL only solves until the end of a betting round. In that case, the leaf nodes are either terminal nodes or chance nodes. To get the value of a chance node, a value network is used. But what should be the active player of a chance node? The paper uses binary flag to encode active player which doesn't seem to include the chance player.

  2. The paper shows that CFR only solves to the end of a betting round, but subgame seems to extend to the start of the next betting round. In that case, there is a gap between the end of a betting round and the start of the next round when the board cards are dealt. How does ReBeL model it? Does it model the chance node and its children to be a separate tree itself. The value of the chance node is the average value of the values of all its children whose values are queried from the value network. Is it true? (referred to: the paragraph in the paper: "Our agent always solves to the end of the current betting round regardless...")

akhti commented 3 years ago

Hi there!

  1. For poker it doesn't really matter as we only query at the end of a betting round, and so we know who will make the next move from the number of the cards on the board.
  2. One out of every 3 datapoints at the end of a betting round we would query the neural net for the values after all ~50 board cards are dealt, average those (not a simple average, have to account for blockers) and then use the average as a training example for the value before the board card is dealt
lilith829 commented 3 years ago

Thank you very much for the explanation. Would you please clarify a few points? Thank you.

  1. Do you mean that to query the value network for the chance nodes (at the end of a betting round), the active player would be the starting player of the next betting round, correct?

  2. For example, on the flop, there are total 22,100 possible board cards (Take 3 cards out of 52 cards). Does the algorithm query the NN for all 22,100 combinations? What do you mean by 1 out of 3 data points and ~50 board cards are dealt? As for the blockers, do you mean that the value of a hand needs to be zeroed out if the hand is overlapped with the board cards?

Attaching the figure from DeepStack paper to illustrate the chance nodes (green nodes) that I mention. image

brewer-b commented 3 years ago

For solving multi-street games, I've seen the average calculated by summing the counterfactual value over each chance card (counterfactual value being 0 when there is collision with the board), divided by 52(deck) - 3(board) - 2(player1 hand) - 2(player2 hand) = 45 when dealing flop->turn, or divided by 44 when dealing turn->river (52 - 4 - 2 - 2). So on the turn to river, you're summing 46 nonzero values, but dividing by 44. This is not really intuitive to me. I'm not certain that it's correct, but I read that was how it was done in Cepheus (I haven't checked myself). Is this the same approach when using neural nets? Or is the divide by 45 or 44 thing wrong altogether?

lilith829 commented 3 years ago

My understanding is that

When taking the average, zero out the values of the hands that are impossible given the newly dealt board cards. According to the paper, the above process is not used in the searching algorithm due to the large number of leaf nodes needs to be queried, but it happens at the end of a betting round. I am not sure if this is the process that actually happens. Feel free to correct me. And I am not sure about the 2nd point that Anton has. Maybe he could help clarify how ReBeL works in this part. Thanks.

noambrown commented 3 years ago

I can help answer some of these questions:

  1. Yes, the active player would be the starting player of the next betting round.

  2. Yes, we periodically query all 22,100 flops and average over all of them to get a training example for a preflop leaf PBS. This isn't very expensive because we don't do it every CFR iteration, or even every play-through. We only do it for 1/3rd of the play-throughs. Similarly, to train the value network at the end of the flop, we average over the value network output for all possible turn cards and take the (reach-weighted) average, and use that average as the training target. This is different from most other bots that use deep counterfactual value networks, because those other bots query the value network at the start of the turn when they are solving a flop. That makes the solving more expensive though. We wanted to do something faster (and simpler).

Brewer, the 46/44 thing is important but a bit of a separate discussion from this. It applies more to CFR when it's being run on multiple streets. In ReBeL we always run CFR only on the current street.

lilith829 commented 3 years ago

Thank you Noam, Anton for the detailed explanation. I understand better now. I'll close this ticket.

yffbit commented 2 years ago

@noambrown @akhti

We only do it for 1/3rd of the play-throughs.

Do you mean that, for every pre-flop leaf PBS, query all 22,100 flops and average over all of them, with probability 1/3, alternatively, just query itself (no need to query all 22,100 flops) with probability 2/3

noambrown commented 2 years ago

I'm not sure what you mean by "every pre-flop leaf PBS". I'm saying for every preflop leaf PBS (before the flop) that we would add to the buffer, we with 1/3rd probability query all 22,100 flops and average over all of them to get a training example.