According to the original paper, the loss function used is a combination of MSE for the Q value and cross-entropy for the move probability distribution, together with L2 regularization on the parameter vector theta. They have a c parameter for controlling the regularization but the MSE/CE losses are weighed the same.
We should probably start with something like this, maybe with something to control the ratio between MSE/CE?
According to the original paper, the loss function used is a combination of MSE for the Q value and cross-entropy for the move probability distribution, together with L2 regularization on the parameter vector theta. They have a
c
parameter for controlling the regularization but the MSE/CE losses are weighed the same.We should probably start with something like this, maybe with something to control the ratio between MSE/CE?