i see that the policy output in the network is n_labels which evaluates to about 1968. and while training you mark the move with the players elo. but then i don't get why you designed the policy head this way.
i am also trying to implement the paper (the whole project is rather simple than this repo) except that i'm using same supervised learning instead of self play for various reasons. and the policy head in my network just outputs 64 values (move probabilities for each cell), and i trained overnight on about 100k games and it works but it's nowhere near your model which has only trained on 10k games. the main thing that is conceptually different in our implementations is the output of the policy head, so i suspect that maybe just move probabilities for all the cells are not enough. so i have a few questions:
why didn't you choose the 64 output policy head?
any thoughts on why your policy head is better than mine?
why mark the move with the player's elo rather than a simple 1?
i'm new to reinforcement learning (deep learning in general), so hope you don't mind helping me. ty
this is more of an help than an issue.
i see that the policy output in the network is
n_labels
which evaluates to about1968
. and while training you mark the move with the playerselo
. but then i don't get why you designed the policy head this way.i am also trying to implement the paper (the whole project is rather simple than this repo) except that i'm using same supervised learning instead of self play for various reasons. and the policy head in my network just outputs
64
values (move probabilities for each cell), and i trained overnight on about100k
games and it works but it's nowhere near your model which has only trained on10k
games. the main thing that is conceptually different in our implementations is the output of the policy head, so i suspect that maybe just move probabilities for all the cells are not enough. so i have a few questions:64
output policy head?elo
rather than a simple1
?i'm new to reinforcement learning (deep learning in general), so hope you don't mind helping me. ty