I just found out in my tests that changing the weights initialization from random_normal to xavier initialization improves the training process a lot.
Using only CPU, the original code takes about 3.5K episodes to reach the reward ~ 22, which is around the maximum reward I was able to obtain reproducing the code.
By using xavier initialization, the code quickly converges to the same result by episode 1K, taking < 30 minutes in my macbook pro using only CPU.
I just found out in my tests that changing the weights initialization from random_normal to xavier initialization improves the training process a lot.
Using only CPU, the original code takes about 3.5K episodes to reach the reward ~ 22, which is around the maximum reward I was able to obtain reproducing the code.
By using xavier initialization, the code quickly converges to the same result by episode 1K, taking < 30 minutes in my macbook pro using only CPU.