AAAI 2018 | Rainbow: Combining Improvements in Deep Reinforcement Learning

DQN

\begin{equation}
\left(R_{t+1}+\gamma_{t+1} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+1}, a^{\prime}\right)-q_\theta\left(S_t, A_t\right)\right)^2
\end{equation}

where $t$ is a time step randomly picked from the replay buffer.

RMSprop
online network $\theta$
target network $\bar{\theta}$

Double Q-learning

Motivation(Problem): overestimation bias due to the maximization step in Equation

Method: decouple the selection of the action from its evaluation

\begin{equation}
\left(R_{t+1}+\gamma_{t+1} q_{\bar{\theta}}\left(S_{t+1}, \underset{a^{\prime}}{argmax} q_\theta\left(S_{t+1}, a^{\prime}\right)\right)-q_\theta\left(S_t, A_t\right)\right)^2 
\end{equation}

Prioritized replay

Motivation: sample frequently those transitions from which there is much to learn.

Solution: PRB samples transitions with probability $p_t$ relative to the last encountered absolute TD error.

\begin{equation}
p_t \propto\left|R_{t+1}+\gamma_{t+1} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+1}, a^{\prime}\right)-q_\theta\left(S_t, A_t\right)\right|^\omega,
\end{equation}

where $\omega$ is a hyper-parameter that determines the shape of the distribution. New transitions are inserted into the replay buffer with maximum priority, providing a bias towards recent transitions. Note that stochastic transitions might also be favoured.

Deuling network

\begin{equation}
q_\theta(s, a)=v_\eta\left(f_{\xi}(s)\right)+a_\psi\left(f_{\xi}(s), a\right)-\frac{\sum_{a^{\prime}} a_\psi\left(f_{\xi}(s), a^{\prime}\right)}{N_{\text {actions }}}
\end{equation}

where $\xi, \eta$, and $\psi$ are, respectively, the parameters of the shared encoder $f{\xi}$, of the value stream $v\eta$, and of the advantage stream $a_\psi$; and $\theta={\xi, \eta, \psi}$ is their concatenation.

Multi-step learning

We define the truncated $n$-step return from a given state $S_t$ as

\begin{equation}
R_t^{(n)} \equiv \sum_{k=0}^{n-1} \gamma_t^{(k)} R_{t+k+1}
\end{equation}

A multi-step variant of DQN is then defined by minimizing the alternative loss,

\begin{equation}
\left(R_t^{(n)}+\gamma_t^{(n)} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+n}, a^{\prime}\right)-q_\theta\left(S_t, A_t\right)\right)^2
\end{equation}

Distributional RL

Model the distribution of returns with probability masses placed on a discrete support $z$, where $z$ is a vector with $N{atmos}\in \mathbb{N}^{+}$ atoms, defined by $z^i=v{min}+(i-1)\frac{v{max}-v{min}}{N{atoms}-1}$ for $i\in {1,\dots, N{atoms}}$. The approximating distribution $dt$ at time $t$ is defined on this support, with the probability mass $p\theta^i\left(S_t, A_t\right)$ on each atom $i$, such that $dt=\left(\boldsymbol{z}, \boldsymbol{p}\theta\left(S_t, A_t\right)\right)$. The goal is to update $\theta$ such that this distribution closely matches the actual distribution of returns.

For a given state $S_t$ and action $At$, the distribution of the returns under the optimal policy $\pi^{\star}$ should match a target distribution defined by taking the distribution for the next state $S{t+1}$ and action $a{t+1}^\star=\pi^\star\left(S{t+1}\right)$, contracting it towards zero according to the discount, and shifting it by the reward (or distribution of rewards, in the stochastic case). A distributional variant of Q-learning is then derived by first constructing a new support for the target distribution, and then minimizing the Kullbeck-Leibler divergence between the distribution $d_t$ and the target distribution

\begin{equation}
d_t^{\prime} \equiv\left(R_{t+1}+\gamma_{t+1} z, \quad p_{\bar{\theta}}\left(S_{t+1}, \bar{a}_{t+1}^\star\right)\right) 
\end{equation}

\begin{equation}
D_{\mathrm{KL}}\left(\Phi_z d_t^{\prime} \| d_t\right)
\end{equation}

Here $\Phiz$ is a L2-projection of the target distribution onto the fixed support $\boldsymbol{z}$, and $\bar{a}^{\star}_{t+1}=arg \max{a} q{\bar{\theta}}\left(S{t+1}, a\right)$ is the greedy action with respect to the mean action values $q_{\bar{\theta}}(S_{t+1}, a)=\boldsymbol{z}^{\top} \boldsymbol{p}_{\theta} (S_{t+1}, a)$ in state $S_{t+1}$.

The parametrized distribution can be represented by a neural network, as in DQN, but with $N{\text {atoms }} \times N{\text {actions }}$ outputs. A softmax is applied independently for each action dimension of the output to ensure that the distribution for each action is appropriately normalized.

Noisy Nets

\begin{equation}
\boldsymbol{y}=(\boldsymbol{b}+\mathbf{W} \boldsymbol{x})+\left(\boldsymbol{b}_{\text {noisy }} \odot \epsilon^{\boldsymbol{b}}+\left(\mathbf{W}_{\text {noisy }} \odot \epsilon^w\right) \boldsymbol{x}\right),
\end{equation}

where $\epsilon^b$ and $\epsilon^w$ are random variables, and $\odot$ denotes the element-wise product. This transformation can then be used in place of the standard linear $\boldsymbol{y}=\boldsymbol{b}+\mathbf{W} \boldsymbol{x}$.

ChufanSuki / read-paper-and-code