CuriosAI / sai

SAI: a fork of Leela Zero with variable komi.
GNU General Public License v3.0
104 stars 11 forks source link

Idea: branch from common handicap #50

Open tychota opened 4 years ago

tychota commented 4 years ago

As far as I understand, sai brings different concept to dramatically improve value network in unfair situation.

If we consider extrem situation like 9 handicap stone, leela zero is considering the stone 1-1 as valuable as 6-4 for white, which is insane for a human.

image

The reasoning is simple: since the winrate is both 0% there is no difference. I will say the bigger the number of move the worst it became, since the initial policy (3-3) point will start to have less importance.

On the contrary in the end games, leela often play sloppy move as she is already far ahead.


To solve this issue, sai (and other improvement on leela) are bringing some concept.

1) the concept of dynamic komi ( which require to have having a monotonous winrate over komi) : by variating the komi, the actor can trick the engine to a more faire situation and let it do better choice

2) by training not only to win but to win by the maximum number of point. We can think of katago function (see under) but also alpha and beta parameter of sai (which looks still weird to me since on contratry of katago, sai doesn't end every game and thus i don't see how those parameter are trained, but that is another problem).

image (Appendix F: Score Maximization of https://arxiv.org/pdf/1902.10565.pdf by @lightvector)

Thoose improvements have a positive impact on plays (end game, handicap plays) but also are a good way to improve user experience. For instance katago can output ownership.

image (Figure 3: Visualization of ownership predictions by the trained neural net. of https://arxiv.org/pdf/1902.10565.pdf by @lightvector)


However that only improve the value head. AFAIK, the policy network is still lost since those situation are super rare (No way that white pass 8 times and black play hoshi).

image (Katago default policy)

My idea is to profit from the branching code to branch a sgf every 10 game. So 1% of the game start with one normal handicap, 1% two handicaps, ect, and the last percent is free placing. Then the network will pick appropriate komi so it is fair.

I think this may improve the joseki for handicap games.


Note: that is maybe a crazy idea (we introduce too much bias). Or it is too risky.

I don't have as much as knowledge as you have (I read paper from time to time, looked at leela and GoNN code and did my small experiment on 7x7 but i have not ML background.

However that may be interesting :) What do you think ?

Vandertic commented 4 years ago

I think the idea is neat. I was starting thinking something similar. But we must think this and discuss it thoroughly.

What does a human strong player do when he is playing with handicap against a weaker player? It depends on the setting and on the people involved, but basically, he should try to keep the game complicated and sometimes to trick the opponent. (Please strong players correct me if you think I am wrong.)

He must not play simple, solid moves, because he is too far behind, and must try something and keep the game uncertain. For example, a 3-3 invasion is generally seen poorly at handicap games, while it is a good move in fair games.

The strong player learned these things while playing with handicap against weaker players. Moreover he learned that what works and what doesn't work depend somewhat on the level and on the style of the weaker player. He can play on the weaknesses of the opponent, so he should.

And here comes the problem. Suppose we simply start a fraction of the games with handicap stones. SAI can distinguish moves if we use lambda>0, otherwise it doesn't. So suppose we put lambda=0.5. Then it will believe to be 100 points behind (with 9 stones) and will adjust the virtual komi interval to something like [0, 70]. So it will try to win with 0 to 70 komi for white. This will generally select solid moves that settle territory and let it finally lose by points. This way we would train a policy that may not be good for playing with handicap. (It would be good for playing with handicap stones for black and high komi in favor of white.)

Moreover, when playing at full strength (not self-play) with handicap against a weak opponent, SAI would not know that it should try to trick him, because there is a strength difference. It would probably just make the game very complicated.

So in the end, what should we do?

I agree that without handicap self-play games we cannot ever train a policy that knows how to play with handicap.

Maybe we can come up with solutions for reducing the problems I am talking about? For example, what would be the effect if SAI were simulating a weaker agent for its opponent while building the tree? Could this maybe trigger some nice superior plays (and some tricks)? Will it weaken the policy for regular games?

I'd like to listen to many ideas and opinions on this matter, but even if no idea is completely convincing, I am willing at some time to do as @tychota suggests. And maybe (I don't know, I didn't think this thoroughly) we may even fork the training and have two different networks.

lightvector commented 4 years ago

Some strong Go players will tell you the "proper" way to play in handicap games is, although still trying to keep the game "complicated", that it does not include trying to trick the opponent, and instead only includes playing the best normal moves you can think of.

It's of course usually not good to play moves whose only potential value is trickiness and that will lose a lot if the opponent does respond correctly.

But I think that to a partial degree, some of these players are also being deceived by their own brain. The very act of trying to keep the game "complicated" involves some implicit/intuitive model of the likely kinds of mistakes that the opponent may make. Humans do often play moves in handicap games that they instinctively would be far less likely to try against an equal-strength opponent (i.e. that they instinctively would discard if the opponent was equal) - and thereby do trade away slight amounts of "their best understanding of goodness of move" (i.e the understanding that they would use in an even game) in exchange for "greater chance for opponent to mess up". I'd guess it's easy for this to even just be subconscious - your brain suggests the kinds of moves that have given you good results in the past, and for humans a large fraction of their past experience in positions where they were >50 points behind at move 40... is precisely in handicap games against weaker players than them, so the moves that gave good results are the ones that did so against those weaker players. (but also humans are just better than bots in general at modeling opponents that may differ from themselves)

KataGo's empirical results on servers suggest that purely trying to lose by less by playing "score-maximizing" moves with zero effort to exploit the opponent is sufficient to give 1 handicap stone per about 2 ranks. Better dynamic komi and similar tuning (including SAI's methods) could increase that a bit more. But I suspect to get all the way to 1 handicap stone per 1 rank, there will need to be some sort of opponent modeling. I plan to run some self-play test runs containing experiments in this direction soon.

(edit: grammar, fixed one thing that in retrospect I think is actually not true)

lightvector commented 4 years ago

Side note: KataGo training does already play a small percentage of its games on handicap up to 4, with komi compensation to make the game even. Adding this does not prevent early 3-3, at least not purely by itself.

Vandertic commented 4 years ago

Thanks for the insight. I totally agree with you about what the handicap play should look like and the fact that the stronger agent should be able to model the weaknesses of the opponent

tychota commented 4 years ago

What does a human strong player do when he is playing with handicap against a weaker player? It depends on the setting and on the people involved, but basically, he should try to keep the game complicated and sometimes to trick the opponent. (Please strong players correct me if you think I am wrong.)

Some strong Go players will tell you the "proper" way to play in handicap games is, although still trying to keep the game "complicated", that it does not include trying to trick the opponent, and instead only includes playing the best normal moves you can think of.

(Answered on discord, copying here for completness) I'm not stong player but I agree with @lightvector that handicap is not trick move (that too risky if they don't fall in trick, you loose a lot) but:

For whats matter https://senseis.xmp.net/?Handicap#toc9 have a lot of subpages about advice as white, handicap fuseki and so on.

Hersmunch commented 4 years ago

There were several LZ threads on handicap training that might be worth referring to e.g. https://github.com/leela-zero/leela-zero/issues/1313#.

Here's a copy of something I put on there that might be of interest:

I stumbled across this AI Safety Gridworlds paper from DeepMind that I think briefly discusses two topics to what we're interested in, in these sections: wrt handicap

2.2.2 Distributional Shift Deep reinforcement learning algorithms are insensitive to risk and usually do no cope well with distributional shifts (Mnih et al., 2015, 2016). A first and most direct approach to remedy this situation consists in adapting methods from the feedback and robust control literature (Whittle, 1996; Zhou and Doyle, 1997) to the reinforcement learning case (see e.g. Yun et al., 2014 - A Unified Framework for Risk-sensitive Markov Control Processes). Another promising avenue lies in the use of entropy-regularized control laws which are known to be risk sensitive (van den Broek et al., 2012; Grau-Moya et al., 2016 - Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes [also see Decision-Making under Bounded Rationality and Model Uncertainty: an Information-Theoretic Approach). Finally, agents based on deep architectures could benefit from the incorporation of better uncertainty estimates in neural networks (Gal, 2016; Fortunato et al., 2017).

wrt weaker (or stronger) opponents:

2.2.3 Robustness to Adversaries The detection and exploitation of the environmental intentions has only recently drawn the attention of the machine learning community. For instance, in the context of multi-armed bandits, there has been an effort in developing unified algorithms that can perform well in both stochastic and adversarial bandits (Bubeck and Slivkins, 2012; Seldin and Slivkins, 2014; Auer and Chao-Kai, 2016); and algorithms that can cope with a continuum between cooperative and adversarial bandits (Ortega et al., 2015). These methods have currently no counterparts in the general reinforcement learning case. Another line of research that stands out is the literature on adversarial examples. Recent research has shown that several machine learning methods exhibit a remarkable fragility to inputs with adversarial perturbations (Szegedy et al., 2013; Goodfellow et al., 2014); these adversarial attacks also affect neural network policies in reinforcement learning (Huang et al., 2017a).

So I've had a quick skim of the first few papers mentioned above. A few take away thoughts include using risk estimation/measure during search to adapt the policy based upon the observed opponent's behaviour in terms of differences between what move they make compared to the moves we predicted they would make. The opponents behavior might be better modelled with an uncertainty parameter. This could allow for risk-averse pessimistic behaviour when more certainly ahead and risk-seeking optimistic behaviour when behind. I have to say that most of the papers are outside of what I can understand so please see what you can make of them too.

tychota commented 4 years ago

Super interesting @hersmunch

barrtgt commented 4 years ago

How well do supervised networks imitate weaker players? What if you train a supervised network on on their games and have a strong net try to maximize points against it while playing with various handicaps?

Vandertic commented 4 years ago

@Hersmunch thank you very much for your references. I happen to know the person that should and will read into those papers. Do you agree, @parton69? ;-) @barrtgt that would be a nice idea. For completeness I would also use old networks with noise perturbed policy and low visits