Closed chufanchen closed 8 months ago
Supervised learning(SL) showed that larger networks result in improved performance(e.g. Language models). In contrast,scaling networks in RL is challenging and requires the use of sophisticated techniques to stabilize learning, such as supervised auxiliary losses, distillation, and pre-training. Furthermore, deep RL networks are under-utilizing their parameters.
We demonstrate that incorporating Soft MoEs strongly improves the performance of various deep RL agents, and performance improvements scale with the number of experts used.
MDP, DQN(CNN/IMPALA, Replay Buffer), Rainbow
A set of $n$ "expert" sub-networks activated by a gating network (typically learned and referred to as the router), which routes each incoming token to $k$ experts.
Soft MoE uses a fully differentiable soft assignment of tokens-to-experts, replacing router-based hard token assignments. Let us define the input tokens as $\mathbf{X} \in \mathbb{R}^{m \times d}$. A Soft MoE layer applies a set of $n$ experts on individual tokens ${fi:\mathbb{R}^d \rightarrow \mathbb{R}^d }{1:n}$. Each expert has $p$ input- and output-slots, represented respectively by a $d$-dimensional vector of parameters. We denote these parameters by $\Phi \in \mathbb{R}^{d \times (n \cdot p)}$.
The input-slots $\tilde{\mathbf{X}} \in \mathbb{R}^{(n \cdot p) \times d}$ correspond to a weighted average of all tokens: $\tilde{\mathbf{X}}=\mathbf{D}^{\top} \mathbf{X}$, where
$$ \mathbf{D}{i j}=\frac{\exp ((\mathbf{X} \boldsymbol{\Phi}){i j})}{\sum{i^{\prime}=1}^m \exp ((\mathbf{X} \boldsymbol{\Phi}){i^{\prime} j})} . $$
D is typically referred to as the dispatch weights. We then denote the expert outputs as $\tilde{\mathbf{Y}}i=f{\lfloor i / p \rfloor}(\tilde{\mathbf{X}}_i)$. The output of the Soft MoE layer Y is the combination of $\tilde{\mathbf{Y}}$ with the combine weights $\mathbf{C}$ according to $\mathbf{Y}=\mathbf{C} \tilde{\mathbf{Y}}$, where
$$ \mathbf{C}{i j}=\frac{\exp ((\mathbf{X} \boldsymbol{\Phi}){i j})}{\sum{j^{\prime}=1}^{n \cdot p} \exp ((\mathbf{X} \boldsymbol{\Phi}){i j^{\prime}})} $$
RL agent: DQN and Rainbow with ResNet
Environment: 20 games from the ALE
Where to place the MoEs? Penultimate layer
What is token? Denoting by $C^{(h,w,d)}\in \mathbb{R}^3$ the output of the convolutional encoders, the tokens defined as d-dimensional slices of this output(PerConv).
What flavour of MoE to use? Top1-MoE and Soft MoE.
Codebase: https://github.com/google/dopamine
As recent research has shown (and which our results confirm), naïvely scaling up network parameters does not result in improved performance. Our work shows empirically that MoEs have a beneficial effect on the performance of value-based agents across a diverse set of training regimes.
Mixtures of Experts induce a form of structured sparsity in neural networks, prompting the question of whether the benefits we observe are simply a consequence of this sparsity rather than the MoE modules themselves. Our results suggest that it is likely a combination of both.
https://arxiv.org/abs/2402.08609