Zeta36 / chess-alpha-zero

Chess reinforcement learning by AlphaGo Zero methods.
MIT License
2.11k stars 479 forks source link

Next step #14

Open brianprichardson opened 6 years ago

brianprichardson commented 6 years ago

As you now have a powerful M60 GPU (about 2-3x my GTX 1070), I am wondering what would be the most helpful next step so that I can still contribute.

Zeta36 commented 6 years ago

@brianprichardson , you have read this great work: https://arxiv.org/pdf/1712.01815.pdf

Two planes seems to be not enough information. Also, I'm afraid MCTS is too slow because of Python. Not enough speed is able to be reached even with a good GPU.

In chess, AlphaZero (the paper above new DeepMind aproximation) outperformed Stockfish after just 4 hours (300k steps). Wow.

apollo-time commented 6 years ago

AlphaZero is the great work! Are you implement AlphaZero now?

Zeta36 commented 6 years ago

@apollo-time , I'm afraid python is not a valid language for implementing a fast enough MCTS.

apollo-time commented 6 years ago

What is main different between alphago zero and alphazero? Is same the MCTS architecture?

Zeta36 commented 6 years ago

They used thousands of TPU cards. I'm afraid we would need thousands of years to replicate this work with a single GPU.

brianprichardson commented 6 years ago

@Zeta36 Thanks for the link, which I will study shortly.

There are estimates of "typical" cpu/gpu compute time for AlphaGo Zero on a regular PC of 1,700 years (which cost Google about $25M, which I suppose is relatively about a buck three eighty for me) .

LeelaZero is using a distributed approach (aka Fishtest) and making slow but steady progress.

BTW Here is another NN chess link: https://github.com/mr-press/DeepChess

I got it to work but it is slow. Also, the paper is a bit vague about which version of Crafty, and Falcon source has not been provided in this or other papers by these authors. But, still a very interesting and promising looking approach.

Finally, Giraffe has a very nice NN for eval, and Zeta chess (https://github.com/smatovic/Zeta) runs in the gpu, so a combination would also be most interesting.

prusswan commented 6 years ago

Stockfish was playing without opening book, so it was below its normal strength. But there is still a good chance that AlphaZero would have won, with less skewed results.

benediamond commented 6 years ago

@Zeta36 @brianprichardson, as you rightly point out, this new paper changes things. My goal will now become replicating this (the input and output structure of the NN need to be changed).

Nonetheless, even if one could successfully replicate it, would training be feasible on commodity hardware? It looks like perhaps not.

But distributed training will be different in this case... Notice how AlphaZero (see the paper) does not use an evaluation step. so there is no such thing as a "best" model, or rather, the best model is automatically updated continuously with no evaluation. So the standard method (i.e., read/write from a remote weight configuration) would make less sense here.

(More precisely, it seems like the primary saving of @Zeta36 / LeelaZero's distributed approach is that of evaluating failed models. That is, if "most" candidate models fail, then establishing these failures concurrently will reduce the latency of new generations. But the work of self-play and training must be still duplicated across all clients.)

Perhaps a new approach to distributed training will be required, under which, say, games of self-play (as well as new models) are uploaded as soon as they are completed, and a recent model is downloaded before every new game. Thus the generation of games, and not the training/testing of a successful model, will be the primary bottleneck. We should think about how to update the distributed training regime as we move forward in replicating the contents of this latest breakthrough by Silver, Hassabis, et. al.

I will keep you posted in the efforts to modify the neural network structure following the new paper.

brianprichardson commented 6 years ago

Training on commodity hardware is simply not practical. Even the distributed "crowd" resources of LeelaZero are not even close the the Google TPU (note, not GPU) hardware used. See http://zero.sjeng.org/ for instance. It runs with several hundred clients contributing.

The training is not distributed, yet, AFAIK. The generation of test games is. For LeelaZero, this is what I think happens: the "best model" is downloaded from the server for each game. Many clients are playing against this and upload game play results I have several systems with gpus and each move takes between 1 and 4 seconds, so entire games take several minutes. From time to time a batch of games is used for optimization training and then the new model is evaluated. These last two steps are only on the server, I think, but may become distributed in the future. There is a lot of community activity with improvements and debugging happening, and GCP does new releases pretty often.

I am frankly still trying to understand the overall process. I was able to get a very basic "hello world" chess MCTS working with python-chess and http://mcts.ai/code/python.html It is very sensitive to the itermax variable. And, I have some understanding of basic NNs. I am still trying to understand putting MCTS and NNs together.

benediamond commented 6 years ago

Ah, so training will happen on the server, and only self-play will occur client-side. Nice. @Zeta36, do you think this would be easy to implement?

@brianprichardson I think the key here is that the NN "guides" the MCTS by attaching to each position a

  1. policy, that is a vector of "scores" or "guesses" for each of its respective children, which is used to decide which children to explore, and a
  2. value, that is an ultimate "score" for the position itself, which is propagated up the tree when that position is encountered as a leaf node.

As the MCTS progresses through its iterations, and more and more values accumulate higher up the tree, they begin to shift the balance, so that exploration behavior that originally depended on the policy "guess" alone now depends more and more on those accumulated values. Finally, a move is chosen based on which child got the most visits (which depends on policy but even more on accumulated values).

Finally, the a-priori policy guess is bootstrapped by training it against the a-posteriori visit counts. The value network is trained using game outcome information. Then you generate self-play data again.

I any case, I will keep y'all posted about the NN changes.

Akababa commented 6 years ago

We could just post all our training data publicly so anyone is free to experiment with different architectures on their hardware