prioitizing past experiences based on temporal difference error
Optimality tightening He et al., 2017 is similar to this paper
Experience replay for actor-critic
actor-critic framework can also utilize experience replay
difference with off-policy and on-policy stackoverflow
off-policy evaluation involves importance sampling (ACER, Reactor; use Retrace to evaluate), that may not benefit much from past experience if the policy in the past is very different from current policy
this paper does not involve importance sampling and both applicable to discrete and continuous control
https://arxiv.org/abs/1806.05635
Abstract
SIL(Self Imitation Learning)
is to verify past good experiences can indirectly drive deep exploration.1. Introduction
2. Related work
3. Self Imitation Learning
4. Theoretical Justification
5. Experiment
6. Conclusion