Zeta36 / chess-alpha-zero

Chess reinforcement learning by AlphaGo Zero methods.
MIT License
2.12k stars 478 forks source link

Data format? #26

Open Akababa opened 6 years ago

Akababa commented 6 years ago

Could someone write a quick documentation of the input planes? Here's what I think it is: The last 8 board positions. each one 8x8x12 Current state, also 8x8x12 Side to move, 8x8 constant Move number, 8x8 constant

I think the move number is typo-d, it's using the halfmove clock instead of 50-move rule counter (which I assume is the intention). We also theoretically don't need the side to move because we can flip the board and invert the colors, so the side to move is always on the bottom and has king on the right. Alternatively we can augment the dataset x2 by applying this transformation, but I think the dimensionality reduction with x2 learning rate is at least equivalent (and probably better). (It doesn't work for Go because of the 7.5 komi rule)

I think we're also missing castling.

Another idea: shuffle training data to avoid overfitting to one game

How is the policy vector represented?

evalon32 commented 6 years ago

Regarding the move number, I can't speak to the intention, but for what it's worth, the AlphaZero paper lists both the move number and the "no progress counter" as inputs (although I can't imagine why: the only way the move number plays any role is in tournament time controls, and times aren't inputs anyway).

In addition to castling availability (4 bits), I think we're missing en passant availability (16 bits, or just 8 bits with the color-flipping dimensionality reduction). In theory, en passant availability is derivative of the previous board state, but I suspect it's better to have it as a direct input.

Lastly, I've been trying to understand the motivation for using previous states as input. It seems like a huge additional cost for a dubious benefit. Let me make a strawman suggestion: Network A is trained on all the inputs that AlphaZero used. Network B is trained on current board state + castling/en passant availability and no other inputs. Can anyone explain what's the advantage of A over B?

Akababa commented 6 years ago

I don't know what goes on in NNs between the first and last layer (and I think very few people do...) so it's good to try everything. With that being said: I think it gives the network a sort of "history heuristic" in case the inputs display some sort of temporal locality; also the last 8 positions help to encode en passant and 3-time-repetition (most of the time). I believe heuristics like this would make training faster at the beginning but the network should rule it out asymptotically (no difference in the long run). In my opinion the same goes for any feature with positive correlation but no causation. There are a few people trying out different inputs, for example I'm using a "side to move always on bottom" representation and I think @benediamond is emulating the most recent AZ as closely as possible.

Also: I like to keep in mind how many weights are in each layer. I was surprised the first time I actually counted and found out 90% of them are in the last layer! Empirically I guess this means it rarely hurts to add more inputs because there aren't too many weights up front anyway.

benediamond commented 6 years ago

Yes, I think the main idea is drawing by 3-move repetition--though this too is strange, because each frame in the 8-state history also includes how many times that position has repeated. I sympathize with the point.