need help understanding the network

this is more of an help than an issue.

i see that the policy output in the network is n_labels which evaluates to about 1968. and while training you mark the move with the players elo. but then i don't get why you designed the policy head this way.

i am also trying to implement the paper (the whole project is rather simple than this repo) except that i'm using same supervised learning instead of self play for various reasons. and the policy head in my network just outputs 64 values (move probabilities for each cell), and i trained overnight on about 100k games and it works but it's nowhere near your model which has only trained on 10k games. the main thing that is conceptually different in our implementations is the output of the policy head, so i suspect that maybe just move probabilities for all the cells are not enough. so i have a few questions:

why didn't you choose the 64 output policy head?
any thoughts on why your policy head is better than mine?
why mark the move with the player's elo rather than a simple 1?

i'm new to reinforcement learning (deep learning in general), so hope you don't mind helping me. ty

Zeta36 / chess-alpha-zero

need help understanding the network #107