Open ppcd401d2 opened 5 years ago
Introduction: Reinforcement learning (RL) helps agent learn decision-making by the interaction between agent and environment. The advent of DQN has enabled it widely used in many fields such as automatic driving game AI, robot automatic control etc. Deep reinforcement learning can further cope with large-scale and data-intensive environments by large neural network. Deep Deterministic Policy Gradient (DDPG) parameterizes the decision-making behavior of agent and relies on the policy gradient method for learning. So that DDPG can handle high-dimensional continuous action domains.
By storing the time correlated experience periods and sample them by uniform probability, experience replay can bate the temporal correlations to cater to the assumption of independent and identically distributed data to improve training efficiency. The appearance of Prioritized Experience Replay (PER) shows us that empirical sampling with uneven probability can improve data utilization rate. On this basis, the algorithm called Deep Curriculum Reinforcement learning (DCRL) increases the probability of experience suitable for learning to improve learning efficiency. The complexity index function of DCRL includes two parts: Self-Paced Prioritized Function shows the importance of experience transitions, and Coverage Penalty avoids unacceptable overtraining. However, the meaningful domain of complexity index function is not large enough on their implementation. So that DCRL may not be popularized well enough.
Upper Confidence Bound algorithm has been applied to various agent exploration problems. It overcomes all the limitations of the exploration based strategy, including the difference in understanding level and suboptimal performance. According to the assumption of noise distribution, the algorithm has many different forms. In the monte carlo search tree, UCB value is the weighted average sum of the state average and a value inversely proportional to the number of times being explored. This form has a similar effect to the complexity index function of DCRL: they are both composed of the weighted average sum of the positive correlation function of the sample's own value and the negative correlation function of the number of sampling times.
In this work, we introduced the Upper Confidence Bound applied to experience replay (UCER). By combining Upper Confidence Bound with complexity index function of DCRL, UCER not only overcomes the limitations based on exploration, but also takes into account the selection of data with appropriate complexity for sample learning in the process of agent training. In order to verify the effect of UCER, we apply it on Atari Game Environments to compare the performance with DCRL and PER on DQN. Then we implemented it on a deep reinforcement learning algorithm called Deep Deterministic Policy Gradient (DDPG). We evaluate and demonstrate the efficiency of our model on OpenAI Gym’s Mujoco continuous control environments.
Related Work: The basic principle of experience replay is to store a transition (the basic form is composed of current state $s{t}$, action $a{t}$, immediate reward $r{t}$ and next state $s{t+1}$) obtained during each exploration into an experience memory, and then randomly select a batch of transitions for random batch gradient descent each time the agent performs parameter training. The advantage of such training is that the data for each training can be regarded as time independent, and it also meets the data independent and identically distributed conditions required for neural network learning, thus reducing the over-fitting phenomenon of model training caused by data correlation.
However, there is still room for further improvement in this method of experience replay. Because in the process of intensive learning exploration, data can not avoid the problems of different complexity and data learning value. The improvement of the experience replay itself can alleviate these problems and improve the learning efficiency of the reinforcement learning model. PER selects the TD-error of each experience transition to be the sampling priority, which indicates how far the value is from its next guidance. The farther the distance is, the more suitable it is for incremental reinforcement learning, because the gradient thus displayed is more obvious.
The concept of curriculum learning aims at improving learning efficiency by sorting the training data by the complexity. Just like the learning process of human beings, the learning process of models should also change from simple to complex. The research on the combination of deep reinforcement learning and curriculum learning is mostly related to transfer learning, just as aiming at designing and sequencing the training environment so that it can acquire knowledge transfer ability and eventually learn more complex environment. In addition, another combination method, namely the DCRL method, is to correct the sampling probability in the experience replay so that the model can sample the experience adapting to the current learning progress with higher probability.
DCRL combines the idea of curriculum learning with the task of deep reinforcement learning. Based on PER, DCRL uses complexity index function $CI(x_i)$ to define the probability of experience transition $x_i$ being sampled:
\begin{equation}\label{eq:CI} CI(x_i) = SP(\delta_i, \lambda) + \eta CP(cn_i) \end{equation}
Here, $SP(\delta_i, \lambda)$ is defined as the self-paced prioritized function. $\delta_i$ is the TD error of transition $x_i$ and $\lambda$ is the curriculum factor indicating the age of the model. $SP(\delta_i, \lambda)$ reaches its maximum when $\delta_i = \lambda$. If the difference between $\delta_i$ and $\lambda$ is larger, the function value is smaller. $SP(\delta_i, \lambda)$ function makes agents more inclined to learn experiences suitable for current progress. $CP(cn_i)$ is defined as the coverage penalty function. $CP(cn_i)$ is a monotonically decreasing function of $cn_i$, which means the number of times transition $x_i$ being sampled. $CP(cn_i)$ increases the diversity of exploration sampling. In DCRL, $SP(\delta_i, \lambda)$ and $CP(cn_i)$ are implemented as follows:
\begin{equation}\label{eq:SP} SP(\delta_i, \lambda)=\left{ \begin{aligned} exp(|\delta_i| - \lambda & & |\delta_i| \leq 2\lambda \ \frac{1}{log(1-\lambda}log(|\delta_i| - 2\lambda + 1) & & \lambda < |\delta_i| < 2\lambda \ 0 & & |\delta_i| \geq 2\lambda \end{aligned} \right. \end{equation}
UCB algorithm is a method for agent exploration. Based on uncertainty, UCB uses optimistic principle and considers that the unknown bonus is as large as possible. UCB will give higher probabilities of exploitations to the known branches with large profits. At the same time, UCB will also give branches that have been explored relatively few times higher probabilities of exploration.
The original upper confidence bound is a bandit algorithm. In making the selection, UCB believes that the value of each branch is as nice as plausibly possible. UCB has different forms in different applications .(i.e., $\epsilon-$greedy algorithm). We first selected the function form in UCT (Upper Confidence Bound Applied to Tree)\cite{Kroese2015Why} as our reference standard:
\begin{equation}\label{eq:UCB_in_UCT} UCT(x_t) = V(x_t) + \sqrt{\frac{\log N}{n_t}} \end{equation}
\begin{figure}[t] \abovecaptionskip = 3pt \includegraphics[width=\linewidth]{ACML2019-PMLR-template/Algorithm_1.png} \caption{DCRL in DDPG} \label{fig:algorithm_1} \end{figure}
Here, $V(x_t)$ represents the expected value of $x_t$, $N$ is the total number of times explored while $n_t$ means the times node $x_t$ explored. Equation ~\eqref{eq:UCB_in_UCT} is positively related to the node value and negatively related to the number of times the node is explored. If we regard self-paced prioritized function as 'node value', Equation ~\eqref{eq:UCB_in_UCT} becomes similar form as Equation ~\eqref{eq:CI}.
The reinforcement learning model we are concerned about is based on \textit{markov decision processes} (MDP). An agent interacts with a single environment $E$. The agent receives state $S_t$ from the environment, and returns action $a_t$. Then agent receives a new state $S_t+1$ and a reward $r_t$ where state $St+1$ is determined by the state transition function ${P}(s{t+1}|s_t,a_t)$ and the reward $r_t$ is determined by ${R}(s_t,a_t)$. We consider a deterministic policy $\pi:\mathcal{S}\mapsto\mathcal{A}$, which means the action agent outputs by the state it observed. The goal of reinforcement learning is to maximizes the expected cumulative return $Rt=\sum{i=t}^{T}\gamma^{i-t}\mathcal{R}(s_i,a_i)$. Here, $\gamma\in\left(0,1\right]$ is the time discount factor.
Some deep reinforcement learning methods , based on Equation ~\eqref{eq:Q_2}, the Q function is simulated by deep learning method, such as Deep Q Network (DQN) ~\cite{Hosu2016Playing}. The environmental state is taken as input, and the Q function corresponding to each action is taken as output. However, the deficiency is obvious ——the output of the network is the same as the action space. If the action space is uncountable, it cannot work under this framework.
Abstract:Reinforcement learning aims at learning action policies from interaction experience. Existing research combines the ideas of deep reinforcement learning and curriculum learning to improve the agent's learning efficiency through dynamically arranging the courses (i.e., experience transitions) on the fly. Most curriculum-based RL methods rely on domain-dependent heuristics in curriculum design, and frequently require a prohibitive amount of engineering effort in complex domains. In this work, we develop an algorithm called UCER (upper confidence bound applied to experience replay) that is general enough and can be directly applied to a variety of curriculum-based RL problems. By setting the curriculum complexity function adapted to the actual difficulty of the course, UCER enables an RL agent to dynamically adjust course difficulties while ensuring the course diversity at the same time. We have experimented UCER with deep deterministic policy gradient using OpenAI Gym's MuJoCo environments, and results suggested that UCER improved learning efficiency in comparison to baseline methods, including a recent deep curriculum reinforcement learning (DCRL) approach.