Open falcondai opened 4 years ago
It is not that we have a no-op
button but rather that the decision making routine can fire off an option to consume the transit time--essentially teleporting itself to the next interesting moment.
But where is the attention (or even consciousness)? Isn't executing the option as attentive as executing a policy? This may only make sense in an online setting. An online learning algorithm is being executed and the option is a fixed routine (reactive policy). The difference is that the learning algorithm can change its decision in the same state on a different visit whereas a policy cannot.
Relates to Bengio's talk at NeurIPS'19
Write a blog about the visualization of A2C playing Atari Pong. It seems that many actions are about the same most of the time (the horizon is limited by gamma) and rarely specific actions are intended. This can motivate the study of semi-Markov decision processes. We can decide to take no actions in anticipation of the transit of the ball.