Adding Soft-Actor Critic method to the version of fixBroken that's been updated for newer gym/tensorflow

Added the Soft Actor-Critic method from Haarnoja (2018) to ExaRL as an alternative to TD3 and DDPG. I have tested this on Pendulum and it worked very well. I also tested on the Humanoid and Hopper from MuJoCo and showed behavior consistent with Haarnoja (2018), which indicates it works to some extent.

There are two versions of the SAC agent, v1 is the version from Haarnoja (2018), v0 uses a truncated normal sampling distribution to handle action space bounds rather than using a tanh to squash samples from an unbounded normal as done in Haarnoja. Both showed similar performance and behavior so I'm keeping both available.

exalearn / EXARL

Adding Soft-Actor Critic method to the version of fixBroken that's been updated for newer gym/tensorflow #266