flrngel / understanding-ai

personal repository
36 stars 6 forks source link

Diversity Is All You Need: Learning Skills without a Reward Function #7

Open flrngel opened 6 years ago

flrngel commented 6 years ago

https://arxiv.org/abs/1802.06070

Abstract

1. Introduction

2. Related Work

Paper says that maximizing diversity is better than specific reward on complex behaviors

3. Diversity is all you need

image image

3.1. How it works

H[a|s] = MI(a,z|s) from continuous action space

F(Θ) = H[a|s,z] + H[z] - H[z|s]

3.2. Implementation

image

4. What skills are learned?

image (alpha with 0.01 is best discriminative illustration)

Question

ben-eysenbach commented 6 years ago

Hi @flrngel ,

If you'd like to play around with code, here is a public implementation: https://github.com/haarnoja/sac/blob/master/DIAYN.md

Here are answers to your questions:

  1. The relevant component of our algorithm is the discriminator, which attempts to tell skills apart. While the architecture for the discriminator in our experiments is a neural network, you definitely could try using a random forest instead (see L218 in the launch script). I expect a random forest may actually work better for tasks where observations have a small number of dimensions. It may also be useful in cases where we want to discriminate on only certain dimensions of the observation, perhaps corresponding to the agent's XY position.
  2. The critic network is a part of of the actor-critic algorithm we use in our implementation. The idea in the paper is not specific to actor-critic algorithms, and can be applied on top of any RL algorithm (e.g., DQN, PPO, ARS, ES), including algorithms that don't use a critic.
  3. In the imitation learning task, each skill visits some distribution over states, and the expert also visits some distribution over states. We do the most straightforward approach: we compute the distance between each skill and the expert, and taking the closest skill. The slightly tricky part is computing distance between distributions over states. If we use the KL-Divergence as our distance metric, then our approach is called an M-Projection. This article has more details on M-Projections and I-Projections.
flrngel commented 6 years ago

@ben-eysenbach I never expected author would find this and comment my question. thank you!