facebookresearch / mtrl

Multi Task RL Baselines
MIT License
223 stars 28 forks source link

The centroid and star-shaped structure of DisTraL #8

Closed c4cld closed 3 years ago

c4cld commented 3 years ago

Description

In 2.3 Policy Gradient and a Better Parameterization of 'Distral: Robust Multitask Reinforcement Learning', the author argues that the centroid and star-shaped structure of DisTraL is good for learning a better distilled policy. However, the explanation is too simple. Could you know the advantages of the centroid and star-shaped structure and explain them in detail? I have tried to contact the author of the paper, but I haven't received a reply. So I come to consult you.

shagunsodhani commented 3 years ago

Hi! Thank you for the question. The paper mentions that Distral learns a distilled policy in the space of policies, which is better than learning in the space of parameters. The basic idea is, it should be easier to interpolate in the space of functions/policies than interpolating in the parameter space. Two networks could have very similar predictions on any given input but have very different weights. Averaging their predictions would give meaningful predictions, while averaging their weights could result in a worse model than the original two models. This is also related to how we perform ensembling- we average the predictions of multiple models and not average the weights of different models.

The paper does not comment about star/centroid being better. If you think otherwise, could you please point me to the relevant line in the paper?

c4cld commented 3 years ago

@shagunsodhani Thank you for you selfless help. The paper mentions star/centroid in 2.3 Policy Gradient and a Better Parameterization. Could you know the advantages of the centroid and star-shaped structure and explain them in detail?

1627264892(1)

shagunsodhani commented 3 years ago

One advantage could be the reduced computation, every model distills with the a central model so the number of distillation operations is linear in number of models. With say a fully-connected topology, the number of operations will be quadratic. This has the obvious limitation that the information exchange is bottlenecked on the central model.

c4cld commented 3 years ago

@shagunsodhani Thank you very much!

shagunsodhani commented 3 years ago

Cool - closing the task - feel free to reopen if needed :)