ChufanSuki / read-paper-and-code

0 stars 0 forks source link

NeurIPS 2021 | Continual World: A Robotic Benchmark For Continual Reinforcement Learning #143

Closed ChufanSuki closed 3 months ago

ChufanSuki commented 4 months ago

https://arxiv.org/abs/2105.10919

https://github.com/awarelab/continual_world

ChufanSuki commented 4 months ago

Continual learning background

Relate area: multi-task learning, curriculum learning, meta-learning

CL training: a sequence of tasks -> non-stantionary

CL evaluation protocols: catastrophic forgetting, forward transfer, and backward transfer

CL system constrained resources: computations, memory, size of neural networks, and the volume of data samples

Continual World benchmark

Metrics: Average performance, Forgetting, Forward transfer, Backward transfer

CW20 sequence: CW10 repeated twice

ChufanSuki commented 4 months ago

Continual RL Methods

Regularization-based methods

Structure-based methods

Rehearsal-based methods

image

ChufanSuki commented 4 months ago

Paper: Overcoming catastrophic forgetting in neural networks Proposed Methods: L2, EWC

image

\begin{aligned}
p(\theta \mid D) & =\frac{p(\theta, D)}{p(D)}=\frac{p\left(\theta, D_A, D_B\right)}{p\left(D_A, D_B\right)} \\
& =\frac{p\left(\theta, D_B \mid D_A\right) p\left(D_A\right)}{p\left(D_B \mid D_A\right) p\left(D_A\right)} \\
& =\frac{p\left(\theta, D_B \mid D_A\right)}{p\left(D_B \mid D_A\right)} \\
& =\frac{p\left(D_B \mid \theta, D_A\right) p\left(\theta \mid D_A\right)}{p\left(D_B \mid D_A\right)} \\
& =\frac{p\left(D_B \mid \theta\right) p\left(\theta \mid D_A\right)}{p\left(D_B\right)}
\end{aligned}

(The last equation because $D_A$, $D_B$ are independent)

\log p(\theta \mid D)=\log p\left(D_B \mid \theta\right)+\log p\left(\theta \mid D_A\right)-\log p\left(D_B\right)

The optimization goal becomes

\begin{aligned}
\theta & =\arg \max _\theta \log p(\theta \mid D) \\
& =\arg \max _\theta\left(-L_B(\theta)+\log p\left(\theta \mid D_A\right)\right) \\
& =\arg \min _\theta L_B(\theta)-\log p\left(\theta \mid D_A\right)
\end{aligned}

This can be viewed as minimizing loss on Task B and maximizing the posterior on Task A, thus avoids catastrophic forgetting.

The posterior $p(\theta \vert D_A)$ is intractable, we can approximate it with variational inference, MCMC or Laplace approximation. In this paper, we use laplace approximation.

Laplace approximation use a Gaussian distribution to approximate $p(\theta \vert D_A)$.

p(\theta \vert D_A) = \mathcal{N}(\sigma, \mu)

Let $f(\theta)=\log p(\theta \vert D_A)$. Because $\theta=\theta_A^*$ is a optimal solution,we have

\mu=\theta_A^*, \sigma=\frac{1}{\mathcal{H}_f(\theta_A^*)}

Rewrite optimization goal:

\begin{aligned}
\theta & =\arg \min _\theta L_B(\theta)-\log p\left(\theta \mid D_A\right) \\
& =\arg \min _\theta L_B(\theta)-\frac{1}{2}\left(\theta-\theta_A^*\right)^2 \mathcal{H}_f(\theta_A^*)
\end{aligned}

To reduce computation cost, we instead compute Fisher information matrix(because fisher information matrix can be computed with Jacobian matrix by definition).

I(\theta) \equiv \mathrm{E}\left[Z(X) Z(X)^{\prime} \mid \theta\right]

$Z(x)$ is score function

Z(X) \equiv \nabla_\theta \log f(x \mid \theta)=\frac{\nabla_\theta f(x \mid \theta)}{f(x \mid \theta)}

The negative expected Hessian of log likelihood is equal to the Fisher Information Matrix $F$(read prove here).

To further reduce computation cost, we assume parameters are independent(we only consider the diagonal of $F$)

F_{i i}=-\mathbb{E}_{p\left(\theta \mid D_A\right)}\left[\left.\frac{\partial^2 \log p\left(\theta \mid D_A\right)}{\partial^2 \theta_i}\right|_{\theta=\theta_A^*}\right]=-\mathbb{E}_{p\left(\theta \mid D_A\right)}\left[\left(\left.\frac{\partial \log p\left(\theta \mid D_A\right)}{\partial \theta_i}\right|_{\theta=\theta_A^*}\right)^2\right]

We use Monte Carlo method to approximate the expectation:

F_{i i} \approx-\mathbb{E}_{x \sim D_A, y \sim p_\theta(y \mid x)}\left[\left(\left.\frac{\partial \log p(y \mid x)}{\partial \theta_i}\right|_{\theta=\theta_A^*}\right)^2\right]

In this paper, we sample every samples

F_{i i}=\frac{1}{\left|D_A\right|} \sum_{x \in D_A}\left(\left.\frac{\partial \log p_\theta\left(Y=y_x^* \mid x\right)}{\partial \theta_i}\right|_{\theta=\theta_A^*}\right)^2

where $yx^\star$ is the output of $p{\theta_A^\star}(y \vert x)$.

Finally, the loss function is

L(\theta)=L_B(\theta)+\frac{\lambda}{2}\sum_t F_i(\theta_i-\theta_{A,i}^*)

For multiple tasks($1...T$), EWC penalizes previous tasks'($1,\dots,T-1$) optimal parameters ($\theta_1^\star,\theta2^\star,\dots,\theta{T-1}^\star$)

L_T^{regularization}=\frac{1}{2}\sum_i(\sum_{t < T} \lambda_t F_{t,i} (\theta_i-\theta_{t,i}^*)^2)