Seems after changing network size hidden layers from 32, 32 to 400, 300 causes convergence. Also, beta & alpha -> 0.1, batch_size -> 100, both lr -> 0.0001, but I feel that the disparity between network nodes is the main reason why divergence occurred. Still need to fully track the stats to see though.
-> Policy diverges quickly. As gradients have been fixed (hopefully), main suspects are probably one of these (or a combination):