AlphaZeroIncubator / AlphaZero

Our implementation of AlphaZero for simple games such as Tic-Tac-Toe and Connect4.
0 stars 0 forks source link

Initial implementation of training/testing loop #28

Closed guidopetri closed 4 years ago

guidopetri commented 4 years ago

Closes #3. Depends on #9, #23 since it uses an API that I'm not sure is going to look exactly like this.

This introduces a dataset class that is not used - it's an initial implementation (that frankly, I'm not sure it works perfectly) for datasets that are written to files.

The training loop is essentially:

Since I'm not confident on how the MCTS/model APIs are going to work exactly, this'll have to be revisited. I also haven't added any tests yet for the same reason.

One thing I have to look into is whether the loss calling method is usually (output, target) or (target, output). In sklearn, it's usually the latter; in pytorch it seems to be the former.

guidopetri commented 4 years ago

@abhon can you elaborate on the neural net API here? what does it take as input and what does it output?

abhon commented 4 years ago

The main network takes a tenor of the following dimensions:

number of samples board state(previous moves + player positions) game length * game height.

In the Alphago Zero paper its game length game height board state, resulting in 19*19*17, but to work with pytorch's conv2d I had to change it a little bit.

It then splits into the PolicyHead, which returns a vector of length game length * game height of move probabilities, and then to the ValueHead, which just returns a number(float between -1 and 1).

PhilipEkfeldt commented 4 years ago

The main network takes a tenor of the following dimensions:

number of samples board state(previous moves + player positions) game length * game height.

In the Alphago Zero paper its game length game height board state, resulting in 19_19_17, but to work with pytorch's conv2d I had to change it a little bit.

It then splits into the PolicyHead, which returns a vector of length game length * game height of move probabilities, and then to the ValueHead, which just returns a number(float between -1 and 1).

I need to go back to the paper, but is the policyhead always a 1D vector or is it sometimes a 2/3D tensor (excluding batch dimensions) to account for spatial dimensions and different pieces? Just thinking from an API standpoint where we want to transform the policy to better manage the state change based on the action.

abhon commented 4 years ago

The main network takes a tenor of the following dimensions: number of samples board state(previous moves + player positions) game length game height. In the Alphago Zero paper its game length game height board state, resulting in 19_19_17, but to work with pytorch's conv2d I had to change it a little bit. It then splits into the PolicyHead, which returns a vector of length game length game height of move probabilities, and then to the ValueHead, which just returns a number(float between -1 and 1).

I need to go back to the paper, but is the policyhead always a 1D vector or is it sometimes a 2/3D tensor (excluding batch dimensions) to account for spatial dimensions and different pieces? Just thinking from an API standpoint where we want to transform the policy to better manage the state change based on the action.

So the paper mentions it as a 1D vector for the policy, but I think it would be easy to rearrange it as the board square with a reshape? Might be easier to store each policy as a vector though and use some arithmetic to find the probability at a certain position.