flrngel commented 6 years ago

Abstract

Learn skills by maximizing information using maximum entropy policy
Train typical reinforcement learning with best skill after unsupervised learning

1. Introduction

Skill is just a policy
Key Idea is discriminability of skills
- Skills has to be distinguishable
- Skills has to be as diverse as possible

Three important distinction of paper
1. Using maximum entropy policies to force skills to be diverse
2. Fix distribution p(z)
3. Watches every states

Paper says that maximizing diversity is better than specific reward on complex behaviors

H[a|s] = MI(a,z|s) from continuous action space

F(Θ) = H[a|s,z] + H[z] - H[z|s]

(alpha with 0.01 is best discriminative illustration)

ben-eysenbach commented 6 years ago

Hi @flrngel ,

If you'd like to play around with code, here is a public implementation: https://github.com/haarnoja/sac/blob/master/DIAYN.md

Here are answers to your questions:

The relevant component of our algorithm is the discriminator, which attempts to tell skills apart. While the architecture for the discriminator in our experiments is a neural network, you definitely could try using a random forest instead (see L218 in the launch script). I expect a random forest may actually work better for tasks where observations have a small number of dimensions. It may also be useful in cases where we want to discriminate on only certain dimensions of the observation, perhaps corresponding to the agent's XY position.
The critic network is a part of of the actor-critic algorithm we use in our implementation. The idea in the paper is not specific to actor-critic algorithms, and can be applied on top of any RL algorithm (e.g., DQN, PPO, ARS, ES), including algorithms that don't use a critic.
In the imitation learning task, each skill visits some distribution over states, and the expert also visits some distribution over states. We do the most straightforward approach: we compute the distance between each skill and the expert, and taking the closest skill. The slightly tricky part is computing distance between distributions over states. If we use the KL-Divergence as our distance metric, then our approach is called an M-Projection. This article has more details on M-Projections and I-Projections.

flrngel commented 6 years ago

@ben-eysenbach I never expected author would find this and comment my question. thank you!