Adding Soft-Actor Critic method to the fixBroken branch

Added the Soft Actor-Critic method from Haarnoja (2018) to ExaRL as an alternative to TD3 and DDPG. I have tested this on Pendulum and it worked very well. The version of this compatible with newer gym/tensorflow has also been tested on the Humanoid and Hopper from MuJoCo and showed behavior consistent with Haarnoja (2018), which indicates it works to some extent.

I would really love to see someone throw this at the 39-bus powergrid example and see if it works at all. Probably with the following flags: --horizon 1 --actor_lr 0.0002 --critic_lr 0.0004 --sac_alpha 0.05

exalearn / EXARL

Adding Soft-Actor Critic method to the fixBroken branch #265