LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.4k stars 525 forks source link

Discussion of GoNN features #345

Open gonzalezjo opened 6 years ago

gonzalezjo commented 6 years ago

GoNN is a sandbox for agents similar to the Leela projects. Notable ideas tested in GoNN, at the time of creating this issue, are the following:

"Cosmetic" Performance
Training with opponent strength plane, against opponents of various strengths. Batch norm gammas tweaking(?)
Removing history information from 10% of training games, to improve analysis utility. Adding specially optimized residual blocks, tailored toward detecting ladders in Go.
Empty Training on ladders
Empty Global pooled properties
Empty Parametric ReLUs
Empty Dilated convolutions
Empty Chain pooling (Replaced by dilated convolutions)
Empty Increasing learning rates on center weights.
Empty "Wide Low-Rank Residual Blocks"
Empty Experiments with different loss functions

Note that some of these ideas were unsuccessful. Ideas that users in the Discord have found worth consideration in Leela Chess Zero have been bolded. If there is anything here that you would like bolded or added to the chart, please make a reply. I will make changes ASAP.

gonzalezjo commented 6 years ago

To kick off discussion:

Training with an opponent strength plane doesn't seem like something we can do easily, since we're not doing supervised learning. However, it enables opponent modeling.

Removing history information from a certain percentage of games could enable analysis from FENs, which is something a lot of people have requested on Discord. I don't know if this could have adverse effects, but it seems like it could be a safe option, based on GoNN's results.

I don't understand the thing about batch norm gammas, so I'm going to avoid making a fool of myself by commenting on it ;)

Adding specially optimized residual blocks might hold promise, if someone finds a way to apply them to chess. I can't come up with anything. :man_shrugging:

Training on special positions, like ladders, has been discussed. Crem brought up the idea of training specifically on tactics. The idea failed in Go. However, adding an input feature to detect ladders was somewhat successful. I'm not sure what the implications are for chess.

Global pooled properties were found interesting by Cyanogenoid, Exa, Hastur the Unspeakable, myself, and probably others for a variety of reasons. It seems to address the issue of telling Leela how to learn specific, highly chess specific knowledge, while still remaining zero.

Parametric ReLUs were found notable by Exa. Matthew Lai, of Giraffe and AlphaZero fame, stated that they "slightly improve final accuracy on some problems, but make training much slower." I'm not sure if GoNN has any data about either, but the performance of activation functions definitely seems to depend a lot on the context, so their results may not cross over to chess anyway. Notable alternatives are PELUs and CReLUs, and SELUs.

Dilated convolutions might be useful for dealing with x-ray attacks, particularly discovered checks. They help propagate information from further across the board, addressing a potential shortcoming of 3x3 filters. (Noted by Hastur the Unspeakable and I on Discord)

Chain pooling was found to be effective, but dilated convolutions were found even more effective, so no comment here.

Increased learning rates on center weights is something I am incapable of discussing.

Likewise, I am unable to discuss "wide low-rank residual blocks."

Two loss functions were tested: cross entropy, and L2 loss. We use cross entropy loss, I believe. GoNN found very little performance boosts from switching loss functions. What GoNN did find, however, was this: "by manual inspection the cross-entropy neural nets seem to be much more willing to give more 'confident' predictions." In other words, the winrate estimates were substantially higher from the GoNN networks. This is mostly cosmetic, it seems, but I found it interesting, given Leela's fame for giving freakishly high evals. I have no idea how that works, though :stuck_out_tongue:

lightvector commented 6 years ago

Just a quick comment about a few things:

PReLUs - I stopped using them. I accidentally left this out of my most recent update, but the mention of it here reminded me. I updated the page now: https://github.com/lightvector/GoNN#update---parametric-relu-instability-in-value-head-jul-2018.

I'm not sure why and have not dug further into it, but I noticed that once I added a value head, parametric relus seemed to be contributing to some instability in the value head training, causing the epoch-to-epoch performance of the value head on the validation set to bounce quite a lot more, and also the improvement, if any, is much less. So for now, I've stopped using PReLUs.

Dilated convolutions - These have always been a toss-up, in that in all test cases they seemed to help the neural net with long-range interactions in Go, but also did not noticeably improve (or harm) the policy loss. Presumably due to some tradeoff where the neural net is instead slightly worse at some local patterns instead due to having fewer non-dilated channels. I have not tested if they improve playing strength in Go overall, but for now I'm still using them anyways.

Global pooling I'm still very happy with.

gonzalezjo commented 6 years ago

Wow, I didn’t know you looked at lc0! Thanks a lot for replying. I guess the PReLU information matches up with what Matthew Lai said.

In regards to dilated convolutions, I am curious about what you think their effect would be on chess? I’m ambivalent, since Leela did seem to have issues with discovered checks and xrays, but then, those problems did mostly go away with more filters. So, I’m left wondering what their benefits could be in chess, with a smaller board. Intuition suggests that Go benefits much more from dilated convolutions with its board size.

Thanks.

lightvector commented 6 years ago

No idea if it would help things. I don't see why you can't just try them though. Take a fixed window of games, freshly train two or three neural nets on that window with the current architecture and two or three with some of the channels converted to dilated convolutions. (two or three of each to reduce the possibility of just getting a single bad training run due to noise/chance, both sides fresh instead of using the existing series of improving nets so you can get an apples-to-apples comparison). Compare their average policy and value loss, and also run a bunch of matches between the two sides and see if one side is detectably stronger. Also you can manually run some test and tactics positions and see by hand if the policy head better understands discovered attacks.

Now that I have MCTS working, I have it in my TODO list to go back and drop dilated convolutions and see what the effect is on the actual strength of play in Go, I'll be doing that in the coming month.

gonzalezjo commented 6 years ago

Great to hear! Let us know how it goes.

Naphthalin commented 4 years ago

@lightvector can you maybe give an update on this Issue? Is this still relevant for KataGo, do you think this applies to any challenges Lc0 is still facing today?

lightvector commented 4 years ago

Unsurprisingly for experimental research, a lot of these ideas didn't survive for KataGo, but a few of them did.

Global pooling Still a massive benefit. I have no doubt that it's one of the things that helps KataGo perform well in positions like https://lifein19x19.com/viewtopic.php?f=15&t=17448 where a single global pooling structure is sufficient to inform the entire board that ko threats of a given size are important, and to globally count up how many there are, or that a particular kind of urgent situation is happening so that the relative value of sente vs gote sequences is changed, or to compute that one player is ahead or behind by a certain amount so that moves should either be more safe or seek to make more trouble, etc. All of these things would take a minimum of "boardsize" or "boardsize*2"-many layers normally, and would be harder to behave uniformly/linearly, now they can happen in roughly 2 or 3 layers. Squeeze-excite has a similar role, but I have been unable to write an implementation of it that is as computationally efficient as the current global pooling - SE adds a lot of sequential overhead due to fitting between blocks and having its little mini network, which I've found hard to optimize and slows down the network, whereas global pooling blocks I've been able to include with barely any overhead.

History masking KataGo still trains using a few percent of samples randomly masking history, so that good play is assured for SGFs that set an initial position directly without history, or for the json analysis engine ,which is also capable of specifying the same. This is taken advantage of in GTP as well for conservativePass, which is an option to make KataGo only pass when there are no "useful" moves left, even if passing would win on-the-spot in Tromp-Taylor rules. This is sometimes is necessary for online servers because when the server doesn't use Tromp-Taylor, passing in response to an opponent pass can be a game-losing move even if in Tromp-Taylor it would win. In general, having a guarantee of network sanity without history adds useful flexibility.

PDA Training with "opponent strength plane" has sort of found its way back in as KataGo's "playoutDoublingAdvantage" training, where a small proportion of games are played with asymmetric playouts on each side, and the network is told the "playoutDoublingAdvantage" it has, which is defined as "log_2(own playouts / opponent playouts)". The network learns specifically what kinds of whole-board positions and what kinds of moves are more or less risky given a large differential in the strength of the two players. Feeding the network a positive playoutDoublingAdvantage gives an enormous strength boost against weaker players in handicap games, and a negative value gives an enormous strength boost against stronger players in handicap games, compared to not having it, And as far as I can tell, it doesn't harm the even game strength to any measurable degree.

This method is also highly overfitting-resistant. Unlike training against any fixed weaker opponent, there should be no risk in equilibrium that the network finds a fragile way to exploit that opponent that doesn't generalize. The opponent is itself with fewer playouts, so if it ever does find a reliable exploit, then becoming good enough at that exploit will also cause itself-with-fewer-playouts to learn that it is losing and avoid it, and then the exploit will stop being reliable. There is still room to do even better, but I feel that this simple and robust method is already good enough to cover a good chunk of what's possible in improving handicap play without actually doing any opponent modeling or more intricate heuristics.

Batch norm Well actually this isn't an idea that survived, it's an idea that notably didn't survive. KataGo no longer uses batch norm at all, so any discussion of batch norm gammas, the need for batch renorm to avoid differences between training and inference statistics, etc is all irrelevant now - these problems are avoided because there isn't batch norm in the first place.

The key paper that enabled this is https://arxiv.org/abs/1901.09321 - using this paper's initialization scheme plus a minor bit of gradient clipping to avoid network blowup enabled stable and efficient training without batch norm, and actually backprop is faster and cheaper too, which is a nice bonus even if it's not as critical as optimizing the self-play side. The current level of clipping in the ongoing run clips almost nothing (something like 1/50000 of the total magnitude of gradient vectors, so probably it's hitting something like two or three out of about every 50000 batches), but is still enough to ensure stability.

I used the new initialization scheme for blocks, plus the recommended multiplier and bias layers (edit: only some of the bias layers - I only added bias layers in the spots where batch norm used to be before, fulfilling what the batch norm beta would have been doing before). I did not have to add anything like Mixup like in the paper, just the tweaks to the architecture were enough to get good results and surpassing what I had before with batch norm.

Naphthalin commented 4 years ago

Thanks for the write-up, I really appreciate it :) Right now of course the moves left head (which is the chess pendant to your score augmented utility) is the focus for Lc0, but especially the PDA (and hopefully the BN stuff as well...) might be of good use to us in the future.