Entropy-guided exploration for better tactical capabilities

kk2796 commented 5 years ago

I originally posted this idea in Google Forum and was encouraged to echo it here. That thread lays out an intuitive case that using entropy to guide exploration can yield highly efficient search strategies that cannot be realized with "success likelihood" and "visit counts" alone. For the skeptics, please feel free to peruse the original thread (or to see something a bit more concrete, skim down in this thread to COOLEST POINT OF ENTIRE POST IMO )

I'm not going to spend any more time arguing for this change; I think the strongest case for it arises from exploring how it could/should work. On that note - I've taken a bit of time to explore the math of this idea, and I'd like to lay out a plausible implementation. I do not claim this is an optimal implementation.

First, it's necessary to define entropy in the context of PUCT search. This is the context where the search is in state "s" and must choose an action "a" using probabilities that align with the various U(s,a) values of all actions. The action chosen will accord to a playout to the node associated with that action. There is already a good amount of information that's calculated and pulled together to compute U(s,a) for action a:

Q(s,a) - the average of all value-head evaluations from prior visits to the node at a and its descendants
cPuct - multiplier to increase the impact of exploration
P(s,a) - the a-priori policy-head probability of action a
N(s,b) - the number of times this state has been reached.
N(s,a) - the number of times action a has already been chosen (always <= N(s,b)).

To this mix, I introduce the term E(s,a). E(s,a) is the entropy associated with the subtree rooted at the node of action a. The entropy of a subtree is the entropy of the distribution of probabilities of a playout from the root of that subtree reaching any particular leaf node (either an unvisited node or terminal "mate" node).
An algorithm for computing E(s,a) for subtree T: 1) Enumerate every possible leaf node that can be reached with a playout from the root node of T. 2) For each playout, computing the step-wise probability of each discrete action (that is, comparing magnitude of U(s',a') of each step to the sum of U(s',*) of all that step's sibling actions).
3) compute the product of all the playout's step-wise probabilities (from root to leaf) to attain the final probability of the playout 4) repeat 2-3 for all possible/legal playouts enumerated in (1) 5) compute entropy from this final set of probabilities, using information-theory formula for entropy of N probabilities: Entropy({p1, p2, p3, ... pN}) = (-1)⋅ sum[n in 1..N]{ pn ⋅ log( pn , baseN) } 6) if there is only one possible playout, e.g. the first visit to a node must visit that node, or there is only one legal response to a move, that equates to a set of a single probability {1}. Entropy is defined to be 0 in this case (in fact, {1} is the only probability set where entropy=0).

For those not familiar with the math here - this entropy calculation will always return a value between 0 and 1. The more uniform the probabilities are, the closer the entropy will be to 1; the more diverse the probabilities, the closer it will be to 0. To illustrate, here are a few sets of 4 probabilities ("log4" indicates log base 4)

low entropy {0.01, 0.01, 0.01, 0.97} (-1)⋅(0.01⋅log4(0.01) + 0.01⋅log4(0.01) + 0.01⋅log4(0.01) + 0.97⋅log4(0.97)) = 0.12
moderate entropy {0.2, 0.3, 0.4, 0.1} = (-1)⋅(0.2⋅log4(0.2) + 0.3⋅log4(0.3) + 0.4⋅log4(0.4) + 0.1⋅log4(0.1)) = 0.92
high entropy {0.25, 0.25, 0.25, 0.25} = (-1)⋅(0.25⋅log4(0.25) + 0.25⋅log4(0.25) + 0.25⋅log4(0.25) + 0.25⋅log4(0.25)) = 1.00
Note - the final example is actually perfectly uniform and thus achieves the maximum possible entropy of exactly 1.

Before I move onto the next topic (what to do with E(s,a)) I should mention that I'm not entirely clear on a low-cost method to track entropy according to this definition. In fact, this may be the largest barrier to implementation. However, there are two reasons to think it may not be a problem: 1) as subtrees grow, the newfound bias toward exploring areas of low entropy should push the entropy values to climb faster than they would otherwise climb by chance 2) the manner in which I plan to incorporate E(s,a) into the overall U(s,a) computation will actually cause a subtree's entropy to become less influential as the number of visits to that subtree go up.

So - what do I propose we do with E(s,a)? No reason to draw it out: Old formula without E(s,a) U(s,a) = Q(s,a) + cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a)) New formula: U(s,a) = Q(s,a) + cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a)⋅E(s,a))

Pausing for a quick FAQ: Q: So, "voila" you've come up with a better search formula than Deepmind? A: As has been the case in many other regards, I suspect Deepmind team's sole concern was to find a simple solution that was "good enough" to carry their thesis. I also suspect the benefits of entropy-bias are far higher depending on sharpness of game... so while this might've made a huge difference if A0 has looked at entropy for chess, it might have been negligible against Go and Shogi. Q: Did... did you just shove the new E(s,a) factor onto the end and hope it would work? A: Maybe. But hear me out: this winds up being pretty slick! Q: Are you seriously wasting github developer's time with cheesy fake dialog? Shouldn't you get on with this? A: Ah - yeah, good point.

So, as I alluded to in the FAQ - while this may seem arbitrary at first blush, it truly is cool how well it works:

1) It's 100% consistent with current behavior for unexplored nodes: both the number of visits N(s,a) and entropy E(s,a) have value of zero, thus N(s,a) = 0 = N(s,a)⋅E(s,a) = 0⋅0

2) It's almost identical to current behavior if distributions of probabilities are nearly uniform. In that case, E(s,a) will be very close to 1, and E(s,a)⋅N(s,a) =~= N(s,a). In other words, the math will only allow the behavior to deviate from typical PUCT search if there are areas of high entropy.

3)

COOLEST POINT OF ENTIRE POST IMO (if you read nothing else, read this) In one circumstance where the E(s,a) significantly impacts U(s,a) and deviates from A0/Leela results, not only does the E(s,a)-based result make more sense than the current formula, but it's so obviously more-correct that it seems the entire case for E(s,a) could stand on this example. The circumstance I'm alluding to is a chess move that corresponds to action a, for which there is just one legal response, a'.

Under the current/Leela/A0 approach to exploration, when action a is unvisited, N(s,a) = 0 and the exploration term for that action has small denominator of (1+0), giving overall exploration factor of N(s,b)^0.5 ⋅ (1/(1+0)) = N(s,b)^0.5 ⋅ 1 = N(s,b)^0.5. In other words, for as long as action a remains unvisited, as N(s,b) grows, the likelihood of this unexplored node being played will climb with sqrt of N(s,b). After the action a is visited, N(s,a') will be 1, and the exploration term will be abruptly cut in half, from factor N(s,b)^0.5 to factor of (1+N(s,b))^0.5 ⋅ (1/(1+1)) =~= (N(s,b)^0.5) / 2

Now consider what happens when E(s,a) is added to the equation. The U(s,a) value for unvisited node is the same as without entropy: N(s,b)^0.5 ⋅ (1/(1+0⋅0)) = N(s,b)^0.5 ⋅ 1 = N(s,b)^0.5. However , after action a, if we compute the entropy of the subtree obtained, we would find that the playout to root node of a' was forced to follow a 100% probability path to a single node. In other words, the subtree with a single forced move has the same E(s,a') = 0 as E(s,a) did for the unvisited node.

The effect is clear: if there is an unvisited node that leads to a forced response, visiting that node once has no diminishing effect on how good a candidate it is for a follow up visit. And IMO, nothing could be closer to what our human intuition tells us: it would be absurd to evaluate a position, see that it leaves just one legal response, and not be bothered to consider what that response would be.

4)

This is more a continuation of point #3. Consider: what if instead of there being just one response that was legal, instead there was only one move that doesn't lose instantly. For example, perhaps the initial action corresponds to a pawn attacking a queen, with only one escape square open to the queen. In this case, the response "move queen to safety" won't have a "pure" 100% policy. But it certainly may have a policy that looks like {0.92, 0.01, 0.02, 0.01, 0.03, 0.01}. For once-visited nodes, these policies become the playout probabilities for the next visit; and in this example, the entropy of those probabilities comes out to 0.22.

For those who just read through #3, it should already be evident that this is going to yield a weaker version of what we saw in #3. Instead of the exploration term doubling, ie denominator increasing from (1+0) = 1 to (1+1) = 2, this low entropy will subdue the rise of denominator; it'll increase from 1.0 to 1.22 and a follow-up visit to this queen-attack node will only be slightly less likely than the original visit.

There's actually quite a bit more I'd like to add... For example, I have dreamed up a much simpler metric than entropy that would capture a lot of the same perks - but I'll stop here for a breath and leave those for a follow-up.

Thanks to those who read this all the way through!

oscardssmith commented 5 years ago

That's really cool!! My only question is if variance could achieve the same result as it would be a lot easier to back up

ddobbelaere commented 5 years ago

Very interesting idea.

@oscardssmith As we don't really want to increase the node structure even more, I would propose to first try it out with on-the-fly entropy calculations during each playout and see how that goes.

@kk2796 Do you want to try out to code it yourself, in which case myself or someone else can certainly give you pointers where to add/modify the code, or do you want someone else to do it?

oscardssmith commented 5 years ago

@ddobbelaere variance backup only requires 4 bytes (1 float) per node. On the fly entropy (or variance) will probably be pretty slow.

kk2796 commented 5 years ago

I think it's almost self-evident that a naive "on-demand" approach (computing entropy dynamically each time it is needed) would entail an untenable performance overhead. If some form of local caching can't come to the rescue, it may be necessary to work with less expensive statistical approximations such as variance (which is more conducive to incremental aggregation vs wholesale refresh). That said, though I don't have the mathematical fortitude to prove this, I do feel the "root problem" is one deeply grounded in information theory and that entropy is the right/best metric to use in solving it. Other ideas may work, but my intuition is that their effectiveness would be limited by how close they came to approximating entropy.

@ddobbelaere I'd definitely prefer to leave the coding to others. I'm 99% data scientist and 1% developer. Downloading Leela's source (along with proper compilers and appropriate IDEs) and then personally trying to put this idea into code sounds like utter drudgery. I suspect I would need weeks, perhaps months, to produce something that was even functional. Even then, my resulting code would be deplorable, and, frankly, I would hate every minute of it.

Backing up a step though - I think there's some meaningful work that can be done before any coding changes (much less discussions of tuning/optimizations) are even on the map. Is there a way to run lc0.exe so that it performs a small fixed-node search (say, 100 nodes), and spools out some representation of the search tree after 100 playouts? That is, something which would allow me to see the tree structure (nodes and edges) as well as the value-head and policy-head outputs for each node? I know there are log options which will display statistics for a position, but as far as I know, one only sees the summary stats for moves of the root node (not the entire tree).

If I can start with that, I'd be able to run some simulations showing what the effect of entropy-bias would be on various positions. In theory, I can accomplish 3 things:

Verify that for wide selection of randomly positions, the E(s,a) factor has a negligible effect on U(s,a) most of the time.
Verify that the positions that are affected, and do experience a significant shift in U(s,a) values, are all positions where one would anticipate entropy to play a role. Furthermore, verify that the role entropy plays is to shift U(s,a) in the expected direction.
Run the simulation for a set of hand-selected positions where Leela is prone to blunder... these would preferably be positions that traditional engines spot the blunder with ease, but leela requires great effort to see them.

I think the first two of these would be most meaningful... however, the third would probably generate the most hype for coding up a solution that can be tested "for real", e.g. assessing ELO impact of the entropy-enabled code vs unmodified code in fixed-node competition.

I should also note that during this initial assessment phase, I could play around with other variants of incorporating E(s,a) into U(s,a) formula and/or add constants that allow the effect to be magnified or dampened.

TLDR: If there's a way to run Leela in a debug or verbose mode to print the entire search tree, including policy/value results of each node, please let me know!

AlexGreason commented 5 years ago

adding 4 bytes to each node (assuming it is only needed in nodes, not edges) is about a 2% increase in total RAM usage by my estimate. Not a problem.

ddobbelaere commented 5 years ago

I want to help you with the coding part, but after thinking for a while I found there is still an issue with the proposed formula.

For moves that have only one legal follow-up move, it holds that E(s,a) = 0. The issue is that those moves will completely usurp all the visits at some point as U(s,a) grows without bound.

This problem is so severe it needs to be sorted out first. My first naive thought would be to take E'(s,a) = E(s,a) + E0, with a fixed and yet to be determined constant E0 > 0, but I am curious if there are other, maybe more elegant solutions.

oscardssmith commented 5 years ago

Any chance of doing it with stdev? I think it will likely yield better results, but is also much easier to calculate and backup.

amjshl commented 5 years ago

Sounds like a very interesting concept, though the mathematics of it is not obvious/intuitive to me at all. Should'nt the entropy term be more coupled with Q(s,a)? Moves which are good with low entropy we would want to explore more, while moves which are bad and low entropy should be explored less and less. One way to look at it is if it is a force sequence of winning moves we want to play it while if it is a forced sequence of losing moves we would want to avoid it at all cost.

ddobbelaere commented 5 years ago

@oscardssmith I think that in the case of only one legal follow-up move the standard deviation is also zero (at least for every possible definition I can think of).

@kk2796 Please allow me to straighten the notation in line with the conventions in literature:

During each playout, we want to maximize the upper confidence bound Q(s,a) + U(s,a), with

Q(s,a) the mean action value of the move (expected outcome averaged over values of visited nodes). U(s,a) = cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a)) a measure of uncertainty on Q that you propose to modify to cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a)⋅E(s,a)).

As an alternative to the original intention of the opening post, I propose to calculate the entropy (or standard deviation if you like that more) over the current visit distribution of the follow-up moves (and not over the upper confidence bound Q(s,a) + U(s,a) distribution). This is much easier to calculate on-the-fly and may be feasible (moreover, it doesn't contain recursion that is present in the original proposition, i.e., that E(s,a) depends on the entropy of the follow-up moves after that, etc.)

kk2796 commented 5 years ago

@ddobbelaere

For moves that have only one legal follow-up move, it holds that E(s,a) = 0. The issue is that those moves will completely usurp all the visits at some point as U(s,a) grows without bound.

Yes, the most extreme effect of E(s,a) is when E(s,a) = 0. Two points I want to make:

The lower the value of E(s,a), the more extreme its effect will be. However, in the most extreme of all cases, E(s,a)=0, the maximum impact would be mathematically identical to considering the node as unvisited. So, while it is true that as long as a node has entropy E(s,a)=0, U(s,a) will "grow without bound", that "boundless growth" is no different from the value of U(s,a) that already occurs for as long as nodes remain unvisited, N(s,a)=0. One might object that the difference is that N(s,a) is bounded by the fact that N(s,a) is guaranteed to climb once U(s,a) grows enough to visit that node, while E(s,a) could remain at 0 even after the node is visited. That takes me to my second point...
A key aspect I think you've overlooked is that E(s,a) represents the entropy of the entire subtree, not just the immediate follow-up (I may start using the term "canopy entropy" to clarify this point). In the already-rare case where white makes a move and black only has one legal follow-up, white will almost certainly have multiple ways to respond to black's follow up, immediately increasing E(s,a) for the subtree to something > 0. There probably have been games that have reached positions where both a response and counter-response are limited to one move, but I suspect one would need to carefully craft any position where E(s,a) =0 after more than two visits.

Put together, 1 and 2 amount to "The most extreme effect of E(s,a)=0 is only as drastic as the effect of N(s,a)=0; and while N(s,a)=0 is guaranteed to grow to N(s,a)>0 with the first visit; E(s,a)=0 is also guaranteed to grow to E(s,a)>0, it may just take 2 or even 3 visits to do so".

I've got spotty availability this morning, but do intend to follow up on other points raised.

kk2796 commented 5 years ago

@ddobbelaere A quick one: I used several sources to better understand the PUCT algorithm (and, I'd hoped, to not sound like an idiot trying to express my idea). Unfortunately, I did not notice that one of those sources, from stanford (really, standford??), misrepresented the upper confidence bound formula. They also did so using a visually pleasing representation; thus their representation is what stuck in my head. However, as lovely a job as they did, I 100% concur that they got it wrong and what you wrote above is actual in line with the AlphaZero paper:

... at = argmaxa(Q(st,a) + U(st,a)) using a variant of the PUCT algorithm (47), U(s,a) = C(s)P(s,a)(N(s)/(1 + N(s,a)) ...

Thanks for pointing out the discrepancy; I'll aim to use the correct nomenclature going forward.

oscardssmith commented 5 years ago

One confusing thing is that the A0 puct formula is different than every other one, so they really shouldn't have called it that.

ddobbelaere commented 5 years ago

@oscardssmith ~~OK, didn't know that A0 convention is actually non-standard :). As the lc0 codebase follows the A0 convention, it's probably still better to use it though.~~ You're probably referring to the term "PUCT" and not to the fact that Q is not included in U.

@kk2796 Thanks for the clarification, it makes more sense now. Now I also understand why nobody was eager to calculate the entropy on the fly :).

I am still annoyed by one thing: Q(s,a) + U(s,a) compared to the Q(s,a') + U(s,a') of all siblings doesn't really represent the probability that a move will be played IMO (in that sense, MCTS is actually very deterministic, it will always pick the move that maximizes Q+U).

Another thing to note (but this is maybe merely a technicality arising from our lc0 convention that -1 <= Q <= 1) is that (Q(s,a) + U(s,a)) / sum(Q(s,a') + U(s,a')) is not what you want it to be (as some terms can be negative), but this can maybe be solved by considering Q+U+1 in the equations instead of Q+U I guess (although I am even less sure about the meaning in terms of probability in this case, but at least it's between 0 and 1 then).

I am still wondering if the entropy over the visit distribution doesn't also make sense. IMO it more corresponds to the probability that a move will be played. Moreover, briefly looking at the backup update equations lets me conclude they are doable (if we use visit distribution) and can be incremental (calculations at each step do not depend on the branching factor)! If we ever do this, we have to check numerical stability of course. I wonder what you or other people think about these considerations.

ddobbelaere commented 5 years ago

After thinking about the issue for quite some time, here are my new ideas:

I still think the issue E(s,a) = 0 with only one string of "one legal move only" moves has not been settled. Imagine the not-so-hypothetical case where we are winning, but there is a move where we stalemate the opponent, then E(s,a) = 0 of this move and it will be chosen at sufficiently high node counts. Also the case of a selfmate in one (where we force the opponent to mate us and he has only one move to do that), albeit less common, is an example.
Note that Q(s,a) of unvisited (in the sense of unexpanded) nodes is not available to us (it involves a NN query). So the original idea should read "Calculate the canopy entropy E(s,a) over all Q+Us of all visited nodes in the tree".
Combining ideas from @oscardssmith and @amjshl (that the entropy is more coupled with Q than with U), I propose yet another adaptation of the uncertainty term. Let S(s,a) denote the standard deviation of all the raw net eval Qs of all visited nodes in the subtree (so not only leaf nodes but also in-between nodes). Note that we can either count multiple terminal visits only once (only calculating the variance over unique nodes) or multiple time. Then I propose to include this standard deviation, which is another source of uncertainty on Q (the other one that is currently present is uncertainty because of low visit count), in the following way:

U(s,a) = cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a)) + c' ⋅ S(s,a),

with c' a scaling constant (c' = 0 reduces to the old formula). The variance S^2(s,a) can be efficiently calculated during the backup step (if we want to store both the variance and the mean in each node, for example both as float) with the numerically stable algorithm https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm, that was pointed to me by @oscardssmith in Discord chat.

Note the (imo) importance of calculating the variance of all the raw net eval Qs. On top of the more easier calculation during the backup step (compared to changing Q's of the in-between non-leaf nodes), it gives us a hint that the raw NN evals fluctuate wildly in the subtree (indicating possible tactical combinations that need to be explored more agressively).

kk2796 commented 5 years ago

@ddobbelaere

So, the more I considered your annoyance at treating the relative values of Q+U as "probabilities of visits", the more annoyed I became with it as well. If nothing in the A0 PUCT algorithm treats Q+U as a probability, that really undermines the justification for insisting upon using "entropy of Q+U probabilities". Or, at least, undermines my pretense that the approach represents some shining beacon of "the exact right metric to use."

On a positive note, this is leaving me 100% on board with going back to the drawing board and looking at this issue and proposed ideas with fresh eyes. In fact - part of the reason I've taken so long to respond is I keep getting halfway through typing a response and finding my ideas evolving to the point that I completely disagree with what I'd just typed. :-)

While I wrestle to find some stable ground in my thinking, I didn't want to draw out the silence any longer. So for now, a couple points I am firm on:

The original idea, and several variants, entail a risk of scenarios where low entropy/stddev/variance/whatever don't only drive Leela to explore that node, but would actually drive her to visit the node so much that it becomes the preferred move. Your stalemate example is a perfect illustration: stalemate has entropy=0, Q=0. Other sibling nodes may have Q=0.1 (but entropy close to 1.0). In this situation, with an ever-growing U, the stalemate would eventually become the move that maximizes Q+U, and eventually take the "visit lead" over its siblings (which are objectively better). This suggests that either the entropy definition (or variance or std dev or whatever) needs special handling for terminal nodes; or that perhaps that we need an alternate way to incorporate E(s,a) into the equation for U(s,a) (such that it can't perpetually negate N(s,a) ). For that matter, just adding as a separate factor on the end, as you propose with c*S(s,a), is probably the cleanest/safest approach of all.
I had alluded earlier to variance/stddev as being "cheaper" metrics to track than entropy, and algorithms like the one you cited were exactly why. I don't know of any way that entropy could be tracked incrementally (each node having a single aggregate value representing "n" data points that can directly incorporate an "(n+1)th" measure). With my original rationale for using entropy now having much less going for it, I think it's probably wise to consider stddev or variance as the default approaches to be tested.

oscardssmith commented 5 years ago

One thing to note is that there's no inherent reason that we have to pick the move with most nodes. If that's the only problem with this, we could switch to Q-U or something as a node selection criteria (which would fix this problem)

ddobbelaere commented 5 years ago

@kk2796 (In response to 2.) I thought I had incremental update formulas for the entropy formulation, but after a second look they are flawed.

I propose to start off with integrating Q-variance tracking in the codebase, such that we have S(s,a) available. In the same "research/test" PR, we could investigate several formulations to include this variance term S(s,a) in the uncertainty term U(s,a). I'll see if I find some time to do this.

Of course we will need lots of testing once the code is ready. We could for example test against a suite of tactical problems at fixed nodes (Maybe there are good tactic suites available in EPD format? Probably @jhortos knows?). There will probably be a lot of tools at our disposal to help us automate (e.g. @killerducky's lc0_analyzer) and we can always ask for help/advice in Discord chat.

Hopefully, it'll improve tactical capabilities and if not, it was a fun ride :).

dEgolf commented 5 years ago

This is a really interesting idea and discussion!

I have an idea for modifying the entropy-based approach above, although I keep finding problems with it. Basically the idea is to replace the 1/(1+N(s,a)) term with 1/(1+N(s,a)(1-R(s,a))), where R is the "remaining relevant exploration" in the subtree associated with a, which trends to 0 as more and more of the (estimated to be) probable parts of the subtree are expanded.

Will hopefully get a chance to write this up properly tonight - just wanted to mention the idea before this discussion moved too far beyond the realm of equation modification.

dEgolf commented 5 years ago

OK, here's the idea in some detail.

My aim was to capture the intuition (from the entropy approach above) that it's more efficient in some sense to calculate a position that is a forced followup to a position than to calculate some other unforced position. Basically, the idea is that these 'follow-up positions' will tend to have higher probability of occurring in the game all else being equal. For example, a position in a forced line stemming from action a_1 will have a higher probability of occurring in the game than a position in one of three equally plausible forced lines stemming from action a_2, assuming that action a_1 and action a_2 are equally probable.

I assume we can estimate the probability that a node will be arrived at in the game (from some base node) by using a product of fractional U+Q values (as described in steps 2 and 3 in the original comment). I term this the "preference" for a node to distinguish it from probability proper.

Instead of looking at the entropy of this "preference", we instead use it to define the concept of “relevant remaining exploration”, abbreviated 'RRExp(s)', and propose using the decay in this metric to discourage exploration. We define RRExp(s) for a node s as: RRExp(s) = pref(s) unexploredPref(s) ...where pref(s) is the preference of node s ...where unexploredPref(s) is the sum of the preference of all (immediate) child states of s corresponding to unexpanded actions at s as a fraction of the total summed preference of the child states of s Intuitively, 'pref(s)' indicates the relevance of exploring at node s to the game, while the unexploredPref(s) indicates the "remaining exploration" to be done at this node s (in terms of expanding direct child nodes).

We now aim to encourage maximizing the expected value of RRExp at the node where we eventually decide to expand (so our expansion is actually useful), modelling the decision of what node to expand at as a random variable. That is, we want to maximize: E[Obtained RRExp(a,s)]= =Expected value of obtained relevant remaining exploration upon selecting action 'a' =\sum_{s' in subtree induced by a} probExpand(s') RRExp(s') ...where probExpand(s') is the probability that the MCTS algorithm chooses to expand at node s'.

I’m not sure how to estimate the probability probExpand(s) of expanding at a state s. As a first attempt one could try: probExpand(s) = (unexploredPref(s))*prod_{s_i: source node -> s} (1-unexploredPref(s_i))
...where the product is conducted along a path from the source node to node s.

In summary, we get the following for the expected value of obtained relevant remaining exploration upon selecting action 'a': E[Obtained RRExp(a,s)]= =\sum{s' in subtree induced by a} probExpand(s') RRExp(s') with maybe probExpand(s) = (unexploredPref(s))*prod{i: source node -> s'} (1-unexploredPref(s'))

The modification to the original equation could then involve changing 1/(1 + N(a,s)) to E[Obtained RRExp(a,s)] or maybe 1/(1 + N(a,s)*(1-E[Obtained RRExp(a,s)])). (Note that The E[Obtained RRExp(a,s)]] becomes large when we want to encourage exploration in the subtree associated with a, which is different than in the entropy case.)

I'm not sure if this really revives the vision of prioritizing visiting states that are obvious followups, and I'm unsure in particular about how well we can estimate probExpand(s).

I do think this avoids falling into the stalemate trap described above, where a stalemating action becomes preferred due to its low subtree entropy. In particular, once the stalemating state is visited, its unexploredPref should become zero (as there are no states following this one that are unexplored), and so further exploring this stalemating state will be discouraged.

kk2796 commented 5 years ago

@dEgolf perhaps another way of framing this is: what depth will the next visit to this node reach? Thinking of scenario with two actions, a and b, with "all things being equal from existing Q and U vantage point" , but a has just 1 or 2 reasonable followups and b has lots of followups. In this case, visits to b will sort of get "hung up" at shallow depth while visits to a should get to deeper levels faster.
The "zero" justification here is that the deeper the move, the closer it is to a game endstate, and thus the more information about the game endstate a visit to a deeper node gives.

Maybe a close-enough idea to what entropy is trying to capture, probably much easier to compute, and definitely avoids stalemate issues.

dEgolf commented 5 years ago

@kk2796 I think that is a way of viewing it; that the aim is to try and get as much information about how the game is likely to go with each node expansion. I would note, however, that the estimated information obtained is not really a function of depth (at least not the way I framed things above). Instead, it corresponds more broadly to how "relevant" we think the expanded node will be, which is basically how likely we think that expanded node is to occur in the game. This does encourage exploring more forcing lines to greater depth as a consequence, though, as the "relevancy" of the nodes in such a line decays more slowly with depth (due to lower branching).

The philosophical connection to entropy I had in mind is that we try to construct a measure of how much information will be obtained by exploring down the subtree associated with action 'a', and use this to weight exploration into this subtree. (Noting that entropy can be viewed as a measure of information).

mooskagh commented 5 years ago

Isn't it just preferring to examine positions with less number of "plausible" moves? Those positions are complex and need hard think (wide tree of variants with similar priors), but it's not the reason to avoid it.

Note that in the end we want to find just a single best move, not maximize confidence of our eval for variety of variations.

If move A is promising but "hard to think" and move B is less promising but almost forcing, I'm not convinced it makes sense to allocate more time into B just to become even more sure that it's less promising.

Am I missing something?

dEgolf commented 5 years ago

@mooskagh The goal in the idea I expressed a few posts above was to maximize the 'relevant information gain' with each node expansion. This doesn't quite correspond to investigating moves that lead to more forcing lines.

We instead aim to expand at nodes that are:

estimated as probable to occur on the board ('relevant')
contain unexplored potential ('information gain') [see the earlier post for proposed equations for these]

As a consequence, if two totally unexplored nodes - say node A and node B - have moves that look equally promising, but we estimate the probability of node B occurring on the board to be higher, then this approach will prefer to perform a node expansion at node B. Notice that this doesn't strictly encourage exploring forcing lines. Once a very forcing line branches sufficiently so that its node probabilities become low enough, then exploring new (less forcing) lines at shallower depths will be preferred.

It should also be noted that the 'information gain' aspect discourages exploring a well explored node further. So, if exploring move A would involve expanding a node with really promising unexplored moves and move B would lead to expanding a node without really promising unexplored moves (with this node having the same estimated probability as the unexplored node following move A), then this will tend to result in further exploration of move A being preferred.

In summary - we're not trying to maximize confidence of our eval for some subset of variations, but instead trying to obtain as much relevant information as possible about the whole set of variations with each node expansion.

Hope this helps clarify things. Thoughts and criticism are very welcome!

wizard2132 commented 5 years ago

Any update on this? I have some ideas I'd like to contribute.

oscardssmith commented 5 years ago

what are they?

mooskagh commented 5 years ago

Size of subtree is node->GetN().

Naphthalin commented 4 years ago

this is definitely an interesting idea, and should be compared to the general properties the tree search should have, see #1231. Also, did anyone ever try to implement this?

LeelaChessZero / lc0

Entropy-guided exploration for better tactical capabilities #780