Q learning updates the policy during a trajectory. The current trainer framework abstracts the episode generation without the ability to train during the episode.
We can modify the framework so that it calls log_reward with the return. Then, the Q learning trainer can overwrite this and compute_action so that it stores the whole SARS. In either the log_return or the compute_action function, we can implement the back propagation.
We can have the Q trainer modify the generate episode function and do the training at the same time.
The first approach is more modular, and I think we'll want to explore both.
Q learning updates the policy during a trajectory. The current trainer framework abstracts the episode generation without the ability to train during the episode.
log_reward
with the return. Then, the Q learning trainer can overwrite this andcompute_action
so that it stores the whole SARS. In either thelog_return
or thecompute_action
function, we can implement the back propagation.generate episode
function and do the training at the same time.The first approach is more modular, and I think we'll want to explore both.