Open simsim314 opened 6 years ago
Hello, @simsim314 .
Regards.
Hi @simsim314, thanks for sharing your thoughts.
This project is still under active construction, just yesterday I wrote a much faster MCTS, and I'll optimize and test it more today. As @Zeta36 said, you're always welcome to contribute and ask questions here if you need any help.
@simsim314, In a fork, I have implemented: 1) The DeepMind-style 119 planes of input (see here). 2) The DeepMind-style NN architecture, with 19 residual layers (see here). 3) ...see Akababa's comment...! 4) Loading only the n most recent play data files are loaded into memory during optimization, on a "rolling" basis, to ease memory consumption (see here).
Unfortunately, in this setup I have failed to achieve convergence, even during supervised learning. Please feel free to help investigate the reason why.
@benediamond, could you try a fast check?
Why don't you add (as @simsim314 said) to the model some others CNN layers like this:
x = Conv2D(filters=mc.cnn_filter_num, kernel_size=mc.cnn_filter_size, padding="same",
data_format="channels_first", kernel_regularizer=l2(mc.l2_reg))(x)
x = BatchNormalization(axis=1)(x)
x = Activation("relu")(x)
before applying the residual blocks?
Maybe the lack of convergence in your NN with so many feed planes is due to the current limitation to two Conv2D layers in the configuration of our models.
@Zeta36 I'm not sure I understand. As it stands, following DeepMind, we already have a residual tower consisting of
1) A convolutional layer containing two filters 2) 19 residual blocks, each of which contains two convolutional layer of 256 filters each.
There are then further convolutions in the policy and value heads.
Let me know if you still think something should be changed.
You are right, @benediamond. We have already two convolutional layer for each residual block. My mistake. I don't really know why nor your model neither mine are able to converge when we introduce new feed planes :(.
I could not even make to converge a 14 planes feeding with your one-hot encoder pieces development (??). I wonder if we could at least converge a model with some more lineal planes (without one-hot encoder) like current player color, number of movement, etc. but leaving the pieces planes as ord() integer values.
I don't know why but I've got the feeling that problem comes with the one-hot planes.
@benediamond Have you tried playing with your model yourself to see if it's qualitatively "getting better"? I'm worried about us falling into the trap of mixing validation and training data.
Also I don't know if the concept of losses converging to 0 is a sound one, because a) you have regularization and b) it can't be lower than the "shannon entropy" of the training data, if that makes any sense.
OK I see everything's there, except that it doesn't work well. I think from practical standpoint, we should reach a place where alpha-zero is not giving away it's queen or pieces for free.
How about using some engine, to train alpha-zero on its blunders in certain positions instead of games, this will reduce the training noise significantly. Once we reach a point where it starts to play reasonably well, we can use self-play to improve.
@simsim314 see the comments on this thread. I'm working on a new version that addresses the "policy flipping" issue; I think Akababa might already have one.
@simsim314 Good ideas, I'm currently adjudicating games based on material but as you say it's probably faster to train the naked network on stockfish outputs or something. If you're doing any self play or evaluation though I have a multithreaded MCTS implementation which is much faster. (I will rewrite in C++ when I get the chance, or maybe someone can help me with this)
I started a wiki page on supervised methods so we can organize our thoughts
I think we should be realistic about our access to good hardware. Google had used 5000 TPUs to generate self play, I think we can safely assume we will not get something like this soon. So we should focus on using existing games, and even there making the best out of them - because running on 40 million games with single GPU will currently take years. So we probably need to analyze blunders in positions and teach our network to avoid making them.
Another point is that MCTS AlphaZero is using, runs 80K positions in second, using 4 TPUs. This is equivalent to 720 TFLOPs, or about 100-200 strong GPUs. On my GPU it runs ~800 positions in second and the question is whether it's possible to use such low count of simulations, to even get something that plays well, aiming to play as good as some engines (above 2500).
The alternative here would be to run each line not till the end, but to some point where the evaluation is certain, thus maybe adding to our policy certainty of our score.
I think ideally we want to follow a path similar to leela zero; find an architecture that shows good potential (which is arguably applying human heuristics lol), write a robust implementation, and start a distributed effort. It's really quite straightforward, but at the moment we're still trying to fix bugs, validate the models and wait for more contributors... tomorrow I'll start looking into C++ implementation of this project, you're welcome to join in if you want to expedite the process.
@Zeta36: Regarding speed, I expect the bottleneck is in gameplay, written in Python. I am happy to help you with a minimal C implementation for that. Let me know if you're interested. I think your code is beautiful, and I rewriting all of it in C++ is a bad idea, but just a C portion for speed critical gameplay seems appropriate.
Yes!! of course your collaboration will be welcome. Please checkout and as soon as you get stable results ask for a pull request :).
I try to use your code to train a model, but there are several "stop production" issues: