Mixtures of Experts Unlock Parameter Scaling for Deep RL

chufanchen commented 9 months ago

chufanchen commented 8 months ago

Introduction

Supervised learning(SL) showed that larger networks result in improved performance(e.g. Language models). In contrast,scaling networks in RL is challenging and requires the use of sophisticated techniques to stabilize learning, such as supervised auxiliary losses, distillation, and pre-training. Furthermore, deep RL networks are under-utilizing their parameters.

We demonstrate that incorporating Soft MoEs strongly improves the performance of various deep RL agents, and performance improvements scale with the number of experts used.

chufanchen commented 8 months ago

Preliminaries

RL

MDP, DQN(CNN/IMPALA, Replay Buffer), Rainbow

Mixtures of Experts(MoEs)

A set of $n$ "expert" sub-networks activated by a gating network (typically learned and referred to as the router), which routes each incoming token to $k$ experts.

Soft MoE uses a fully differentiable soft assignment of tokens-to-experts, replacing router-based hard token assignments. Let us define the input tokens as $\mathbf{X} \in \mathbb{R}^{m \times d}$. A Soft MoE layer applies a set of $n$ experts on individual tokens ${fi:\mathbb{R}^d \rightarrow \mathbb{R}^d }{1:n}$. Each expert has $p$ input- and output-slots, represented respectively by a $d$-dimensional vector of parameters. We denote these parameters by $\Phi \in \mathbb{R}^{d \times (n \cdot p)}$.

The input-slots $\tilde{\mathbf{X}} \in \mathbb{R}^{(n \cdot p) \times d}$ correspond to a weighted average of all tokens: $\tilde{\mathbf{X}}=\mathbf{D}^{\top} \mathbf{X}$, where

$$ \mathbf{D}{i j}=\frac{\exp ((\mathbf{X} \boldsymbol{\Phi}){i j})}{\sum{i^{\prime}=1}^m \exp ((\mathbf{X} \boldsymbol{\Phi}){i^{\prime} j})} . $$

D is typically referred to as the dispatch weights. We then denote the expert outputs as $\tilde{\mathbf{Y}}i=f{\lfloor i / p \rfloor}(\tilde{\mathbf{X}}_i)$. The output of the Soft MoE layer Y is the combination of $\tilde{\mathbf{Y}}$ with the combine weights $\mathbf{C}$ according to $\mathbf{Y}=\mathbf{C} \tilde{\mathbf{Y}}$, where

$$ \mathbf{C}{i j}=\frac{\exp ((\mathbf{X} \boldsymbol{\Phi}){i j})}{\sum{j^{\prime}=1}^{n \cdot p} \exp ((\mathbf{X} \boldsymbol{\Phi}){i j^{\prime}})} $$

chufanchen commented 8 months ago

Method

RL agent: DQN and Rainbow with ResNet

Environment: 20 games from the ALE

Where to place the MoEs? Penultimate layer

What is token? Denoting by $C^{(h,w,d)}\in \mathbb{R}^3$ the output of the convolutional encoders, the tokens defined as d-dimensional slices of this output(PerConv).

What flavour of MoE to use? Top1-MoE and Soft MoE.

Codebase: https://github.com/google/dopamine

chufanchen commented 8 months ago

Conclusion

As recent research has shown (and which our results confirm), naïvely scaling up network parameters does not result in improved performance. Our work shows empirically that MoEs have a beneficial effect on the performance of value-based agents across a diverse set of training regimes.

Mixtures of Experts induce a form of structured sparsity in neural networks, prompting the question of whether the benefits we observe are simply a consequence of this sparsity rather than the MoE modules themselves. Our results suggest that it is likely a combination of both.

chufanchen commented 8 months ago

The role sparsity can play in training deep RL networks, especially for parameter scalability? The State of Sparse Training in Deep Reinforcement Learning
Different values of 𝑘 for Top1-MoEs, different tokenization choices, using different learning rates (and perhaps optimizers) for routers.
Beyond ALE could provide more comprehensive results and insights(e.g. computation expense Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research).

chufanchen / read-paper-and-code