Get the q learning algorithm from the dssi class branch

Q learning updates the policy during a trajectory. The current trainer framework abstracts the episode generation without the ability to train during the episode.

We can modify the framework so that it calls log_reward with the return. Then, the Q learning trainer can overwrite this and compute_action so that it stores the whole SARS. In either the log_return or the compute_action function, we can implement the back propagation.
We can have the Q trainer modify the generate episode function and do the training at the same time.

The first approach is more modular, and I think we'll want to explore both.

LLNL / Abmarl

Get the q learning algorithm from the dssi class branch #284