Extracting parts of Lc0's search into classes would help future development

As of now, Lc0's search is still using AlphaZero's original PUCT, with a few modifications in WDL+draw score, some certainty propagation, and the general batching strategy. This certainly works decently well, but one of the reasons why there has been a lack of progress in the search can be attributed to how difficult it is not only to test search changes (especially if they affect the scaling behaviour), but also how difficult it is to implement them in the first place. I would like to change the latter part with this proposal.

Currently, an experiment in the search code involves

the functional changes into search.cc and search.h itself
possibly changes in node.cc and node.h if a new node output is needed
addition of any new parameters and their callers in params.cc and params.h
addition of these new parameters in several places in search.cc, which is rather counterintuitive e.g. in move sorting

This effectively renders casual experiments unfeasible, and on top requires a major effort in maintaining branches mergeable with master, which increases the difficulty of contributing to Leela's search.

I have identified the following aspects of Lc0's search code which IMO would profit from being extracted similar to how timemgr, as experiments in the respective fields would then only require creating the associated class file, and registering the new class in the respective file, without touching any of the above files or creating merge complications.

Selection of move to play. Currently, we're using two strategies: most visits, and with temperature. There are other strategies of interest like sampling according to NN policies which are yet to be implemented.
Selection of move to explore. Currently, this uses PUCT, with the addition of drawscore, and some batching logic in multigather and collision scaling. Here, a lot of alternatives are possible (and explored in the literature).
Result propagation. Currently, this follows the general MCTS principle of averaging all outcomes, and calculating these averages in an incremental way. This would need some adaptation for the MCGS implementation at least, and in general when moving away from pure averaging.
Node value recalculation. This doesn't yet exist, but in the case of leaving averaging behind (as done in most MCTS like PUCT and MCGS), and calculating the value of a node from the value of its child nodes e.g. as RENTS or my own #963 do it, this would need to be added, and multiple strategies are possible.

I will create a post for each of the 4 points, collecting (a) necessary functions for such classes, (b) spell them out at least in pseudo-code (or copy from the code where available), and (c) list all parts of the codes which need replacement.

1. Selection of move to play.

Current strategies:

move with max visits. This is done through Search::GetBestChildNoTemperature(), which itself calls GetBestChildrenNoTemperature(), which returns an ordered list of edges. The same order is also used for multiPV, verbosemovestats and other locations, and returning the best move just piggybacks on these functions.
move with temperature. This is done throhugh Search::GetBestChildWithTemperature(), acting if temperature is set, and taking into account several additional temperature parameters. This is used for creating training games, so backwards compatibility becomes an issue.

Additional strategies:

move with highest LCB. This is likely best done in the move ordering, as LCB is a generalization of Q and N based selection strategies, and affects not only selected move, but also move ordering in verbosemovestats etc.
move according to policy. There is a recent PR #1732 implementing this, though likely an additional parameter specifying a lower bound to chosen policy would be nice.

The class(es) would need to employ

a function returning a single move.

This could be done either by looping over root's child nodes, or by giving an array with all relevant information. A few parameters would need to be handed over from search:

drawscore
fpu
pst (maybe)

2. Selection of move to explore.

This probably is the hardest of them, but also the most interesting one for future development.

Current strategies:

PUCT. This defines a score function as value + exploration term S=Q+U

Additional strategies:

RENTS #1418
818 and other local changes to PUCT
925 and similar transformations of Q; as it turned out, this didn't work as intended, and both adding and removing it was a major piece of work.
Thompson Sampling ("sample nodes proportionally to your estimate that they can be the best move"). I experiment with that in https://github.com/Naphthalin/lc0/tree/betamcts-rents mostly for analysis of human games
any other curiosity/surprisal/uncertainty/... guided strategy
non-zero changes like check extensions, capture sequences, ...

General remarks:

we made multigather path default in #1550, so ideally we only consider multigather (i.e. collect a batch of nodes to visit "at once" instead of one-by-one) in the current format
the above strategies can be divided into quasi-deterministic (each node has an exploration score) and randomized strategies (each node has an exploration probability)
luckily, conversion in both directions is possible through the Gumbel-Max-trick, and can be efficiently extended to gathering batches by using "draws with and without replacement" strategies.
this means that the current way of definining an exploration score for each node, taking into account certain search parameters, is sufficient to cover both types of strategies, and works with multigather.

The class(es) would need to employ the following functions:

exploration score of all child nodes of a node, including unvisited ones
a way of caching "once per parent node" calculations
a version of GetVisitsToReachU() to speed up multigather
heuristics dependent multiPV/verbosemovestats outputs
some way of hiding strategy specific fields in the Node struct at compile time, or mirroring/replacing the Node structure to allow for memory efficient builds
mixed strategy, e.g. based on visits (my betamcts-rents branch with the BetaTS option uses PUCT for <100 nodes and then uses Thompson Sampling)

3. Result Propagation

Current strategy:

incremental updates for all nodes along the root -> new leaf node path, such that Q/D/M of a node is the average of all NN evals of the subtree
plus some terminal logic for proving wins/losses, also acting incrementally

Other strategies:

remove V of a node if extending it (might be better for mate lines), so Q/D/M is only the average of all current leaf nodes
minimax update of all nodes along the root -> new leaf node path
"soft minimax", where each child node Q gets a weight on top of the visits; pure averaging would be "all weights = 1", minimax would be "best Q has weight 1, all others weight 0"
MCGS (when encountering a discrepancy between nodes in the tree, send an amplified correction signal instead of a true Q/D/M)
in principle, during the update process a node recalculation could be triggered as well

The class(es) would need to employ the following functions:

provided with old values plus weight and "new result" from the explored child, return new Q/D/M/N node and update values for the parent
some way of blocking strategies based on the necessary Node fields for optimized builds
how to deal with new terminals (though that's likely the same for all possible strategies, and is already implemented)
if in a graph search implementation a node has more than one parent, decide whether to update other parents than the traversed one (I personally think that's a bad idea anyway, so probably nothing to worry about)

4. Node value recalculation

This doesn't yet exist, with the small exception of reverting twofold draw visits to restore balance to the tree, though there are known side effects in fast TCs (#1466, fix attempt in #1513).

A node recalculation would be necessary in all strategies which can't be fully represented as incremental updates. In general, it means calculating Q/D/M of a node from its children, which effectively allows pruning of branches which turned out bad, and hence also allow exploration strategies which don't converge on mostly exploring the best moves. #963 and the mentioned betamcts-rents branch explore a version of "soft minimax recalculation", which interpolates between pure averaging at low nodes to pure minimax at infinite nodes, and allows for nearly arbitrary forced exploration as it happens in forward/backward analysis.

However, a big caveat is that this can lead to non-incremental changes in Q, which interacts badly with PUCT, as the latter relies on incremental updates in order for the PUCT score to (semi-)reliably spread the visits across moves. This means that likely neither a recalculation loop nor a replacement for PUCT will lead to an improvement without the other, and they need to be tested together.

The class(es) would need to employ the following functions:

when to be triggered: during incremental update or just before exploration
triggered at each visit, or from a bool Recalculate(int visits)
return new node values after reading all child nodes
some way of blocking strategies based on necessary Node fields missing in optimized builds
mixed strategy, e.g. based on visit (my betamcts-branch currently uses pure averaging for <100 nodes and only then recalculates nodes)

There are definitely a large number of different paradigms that would be nice to explore and test. Making the code more modular in this way would definitely make it much easier to do so.

More specifically, being able to experiment more easily with the search, value calculation, and the integration of the two would help massively.

Currently, by the nature of MCTS, search is somewhat bound to the assumption that any move worth more exploration is also more worth playing, and vice versa, even though this isn't necessarily true, especially in chess. So the ability to experiment more easily with functions with aims akin to the aims of quiescence, pruning, minimax, and more without needing to overwrite code for existing strategies in the process is definitely something we should aim for.

I very much support this proposal.

LeelaChessZero / lc0

Extracting parts of Lc0's search into classes would help future development #1734

818 and other local changes to PUCT

925 and similar transformations of Q; as it turned out, this didn't work as intended, and both adding and removing it was a major piece of work.