Does VIN naturally work with reinforcement learning?

Hi,

The VIN is a mapping from observation to probability over actions, therefore it can be directly used as a policy representation in either supervised learning or RL algorithms. For policy gradient type algorithms, this is immediate. For Q-learning, you can think of the output of the VIN as approximating a Q value.

Indeed, the VIN formulation is most suitable for problems where the underlying planning computation can be represented as a finite (and small) MDP. Many problems have this property - see for example the continuous control domain in the paper, where, although the problem was continuous, the essential planning computation could be done on .a grid. However, there are many problems where this does not hold. For example, in many Atari games the planning problem is not naturally represented on a grid/graph (at least not in a trivial way). Extending the idea of deep networks that perform a planning computation to such domains is still an active research area. Recent papers along this direction include Value Prediction Networks by Oh et al, and Imagination Augmented Agents from Deepmind.

Aviv

On Fri, Nov 24, 2017 at 7:15 PM, xinleipan notifications@github.com wrote:

From my view of the paper, examples shown in the main paper were mainly aimed at supervised learning (imitation learning), though there are some examples using reinforcement learning. So the question is does VIN naturally works with RL? In addition, almost all examples involve extracting some high level grid world representation of the state space, it is not clear how this model may be applied to a more realistic domain where representing all states may be infeasible?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/avivt/VIN/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AOeQNc5TwTyQq_06v_VqkutVym0HW7u-ks5s54Y3gaJpZM4QqSem .

avivt / VIN

Does VIN naturally work with reinforcement learning? #11