Zeta36 / chess-alpha-zero

Chess reinforcement learning by AlphaGo Zero methods.
MIT License
2.12k stars 479 forks source link

Lets make it practical #32

Open simsim314 opened 6 years ago

simsim314 commented 6 years ago

I try to use your code to train a model, but there are several "stop production" issues:

  1. You've mentioned already two planes is not enough, but is it stop production? I think it's critical.
  2. I saw you're using two layer CNN model, while DeepMind usually use very deep network. How complex a model should be, to be capable to learn chess?
  3. Generating self play takes too long. I have good GPU (GTX 970) and it takes me a minute per game. While we need tens of millions of games, generating 2K games takes 24 hours (it will take about 30 years to only generate all the data).
  4. Even supervised learning has limitation. The biggest limitation is loading all the games into memory, before GPU optimization. I've 8GB RAM this makes me limited to 3K games, how about loading 1K games running about 5 epochs, and then load new games. This will allow to train on tens of thousand of games.
Zeta36 commented 6 years ago

Hello, @simsim314 .

  1. @benediamond , @Akababa and myself tried to enlarge the feed planes but we failed until now to make the model to converge with the 177 (or so) planes DeepMind uses. You are welcome to try to do this. We are right now still working on this.
  2. We could try to use different model structures for sure. Maybe this two layer CNN could even be the problem why we cannot get convergence with a more rich input feeding.
  3. Yes. This is a general issue DeepMind resolved using 1000 TPUs card. In our project a distributed version is ready to be use in the future but we have not yet started to make use of this feature.
  4. You can develop the supervised learning in that way easily. If you get good results I will merge your pull request ;).

Regards.

Akababa commented 6 years ago

Hi @simsim314, thanks for sharing your thoughts.

This project is still under active construction, just yesterday I wrote a much faster MCTS, and I'll optimize and test it more today. As @Zeta36 said, you're always welcome to contribute and ask questions here if you need any help.

benediamond commented 6 years ago

@simsim314, In a fork, I have implemented: 1) The DeepMind-style 119 planes of input (see here). 2) The DeepMind-style NN architecture, with 19 residual layers (see here). 3) ...see Akababa's comment...! 4) Loading only the n most recent play data files are loaded into memory during optimization, on a "rolling" basis, to ease memory consumption (see here).

Unfortunately, in this setup I have failed to achieve convergence, even during supervised learning. Please feel free to help investigate the reason why.

Zeta36 commented 6 years ago

@benediamond, could you try a fast check?

Why don't you add (as @simsim314 said) to the model some others CNN layers like this:

        x = Conv2D(filters=mc.cnn_filter_num, kernel_size=mc.cnn_filter_size, padding="same",
                   data_format="channels_first", kernel_regularizer=l2(mc.l2_reg))(x)
        x = BatchNormalization(axis=1)(x)
        x = Activation("relu")(x)

before applying the residual blocks?

Maybe the lack of convergence in your NN with so many feed planes is due to the current limitation to two Conv2D layers in the configuration of our models.

benediamond commented 6 years ago

@Zeta36 I'm not sure I understand. As it stands, following DeepMind, we already have a residual tower consisting of

1) A convolutional layer containing two filters 2) 19 residual blocks, each of which contains two convolutional layer of 256 filters each.

There are then further convolutions in the policy and value heads.

Let me know if you still think something should be changed.

Zeta36 commented 6 years ago

You are right, @benediamond. We have already two convolutional layer for each residual block. My mistake. I don't really know why nor your model neither mine are able to converge when we introduce new feed planes :(.

I could not even make to converge a 14 planes feeding with your one-hot encoder pieces development (??). I wonder if we could at least converge a model with some more lineal planes (without one-hot encoder) like current player color, number of movement, etc. but leaving the pieces planes as ord() integer values.

I don't know why but I've got the feeling that problem comes with the one-hot planes.

Akababa commented 6 years ago

@benediamond Have you tried playing with your model yourself to see if it's qualitatively "getting better"? I'm worried about us falling into the trap of mixing validation and training data.

Also I don't know if the concept of losses converging to 0 is a sound one, because a) you have regularization and b) it can't be lower than the "shannon entropy" of the training data, if that makes any sense.

simsim314 commented 6 years ago

OK I see everything's there, except that it doesn't work well. I think from practical standpoint, we should reach a place where alpha-zero is not giving away it's queen or pieces for free.

How about using some engine, to train alpha-zero on its blunders in certain positions instead of games, this will reduce the training noise significantly. Once we reach a point where it starts to play reasonably well, we can use self-play to improve.

benediamond commented 6 years ago

@simsim314 see the comments on this thread. I'm working on a new version that addresses the "policy flipping" issue; I think Akababa might already have one.

Akababa commented 6 years ago

@simsim314 Good ideas, I'm currently adjudicating games based on material but as you say it's probably faster to train the naked network on stockfish outputs or something. If you're doing any self play or evaluation though I have a multithreaded MCTS implementation which is much faster. (I will rewrite in C++ when I get the chance, or maybe someone can help me with this)

I started a wiki page on supervised methods so we can organize our thoughts

simsim314 commented 6 years ago

I think we should be realistic about our access to good hardware. Google had used 5000 TPUs to generate self play, I think we can safely assume we will not get something like this soon. So we should focus on using existing games, and even there making the best out of them - because running on 40 million games with single GPU will currently take years. So we probably need to analyze blunders in positions and teach our network to avoid making them.

Another point is that MCTS AlphaZero is using, runs 80K positions in second, using 4 TPUs. This is equivalent to 720 TFLOPs, or about 100-200 strong GPUs. On my GPU it runs ~800 positions in second and the question is whether it's possible to use such low count of simulations, to even get something that plays well, aiming to play as good as some engines (above 2500).

The alternative here would be to run each line not till the end, but to some point where the evaluation is certain, thus maybe adding to our policy certainty of our score.

Akababa commented 6 years ago

I think ideally we want to follow a path similar to leela zero; find an architecture that shows good potential (which is arguably applying human heuristics lol), write a robust implementation, and start a distributed effort. It's really quite straightforward, but at the moment we're still trying to fix bugs, validate the models and wait for more contributors... tomorrow I'll start looking into C++ implementation of this project, you're welcome to join in if you want to expedite the process.

lucasart commented 6 years ago

@Zeta36: Regarding speed, I expect the bottleneck is in gameplay, written in Python. I am happy to help you with a minimal C implementation for that. Let me know if you're interested. I think your code is beautiful, and I rewriting all of it in C++ is a bad idea, but just a C portion for speed critical gameplay seems appropriate.

Zeta36 commented 6 years ago

Yes!! of course your collaboration will be welcome. Please checkout and as soon as you get stable results ask for a pull request :).