ICCV 2023 | A Unified Continual Learning Framework with General Parameter-Efficient Tuning

chufanchen commented 7 months ago

https://arxiv.org/abs/2303.10070

https://github.com/gqk/lae

chufanchen commented 7 months ago

Motivation

Setting: Pre-train $\rightarrow$ downstream adaptation in continual learning

Existing methods: L2P, DualPrompt, ESN

Limitations:

They are all constrained to a specific PET method $\rightarrow$ transfomer only
Most of them rely on selecting task-specific parameter for each individual task $\rightarrow$ The selection tends to be noisy with increasing task numbers and the task-specific prompts appear homogeneous

To accommodate various PET methods, a key challenge is that different PET modules have different adaptation speeds (for novel tasks), as well as different forgetting speeds (for historical tasks). $\rightarrow$ learning with calibrated speed

chufanchen commented 7 months ago

model $f(\cdot;\theta, \phi)$

Pre-trained feature extractors: $\theta_{pre}$

Classifier $\phi{old}$ for all learned tasks $\mathcal{T}{1:i}$

Classifier $\phi{new}$ for current learning task $\mathcal{T}{i}$

Naive Baseline: Seq-FT

Proposed Framework: LAE

Learning with calibrated speed

At the beginning of the training, we only learn $\phi{new}$ with $\theta{pet}$ frozen;

Then after $\phi{new}$ has sufficiently learned and the loss has significantly decreased, we jointly learn both $\phi{new}$ and $\theta_{pet}$.

Accumulation of multi-task knowledge

Inspired by Complementary Learning System of the human brain, we define two experts hippocampus-like online PET module $\theta{pet}^{on}$(i.e. the $\theta{pet}$) and the neocortex-like offline PET module $\theta_{pet}^{off}$.

The $\boldsymbol{\theta}_{\text {pet }}^{\text {off }}$ slowly accumulates the learned knowledge when the model learns a new task by an accumulation function, and we empirically find the simple Exponential Moving Average (EMA) algorithm works well for our LAE:

\boldsymbol{\theta}_{\text {pet }}^{\text {off }} \leftarrow \alpha \cdot \boldsymbol{\theta}_{\text {pet }}^{\text {off }}+(1-\alpha) \cdot \boldsymbol{\theta}_{\text {pet }}^{\text {on }},

where $\alpha \in(0,1)$ is a large (i.e., close to 1 ) weight decay.

Ensemble of two expert models

A classifier can be viewed as an energy model when we define the unnormalized negative log probability as the energy function. The energy produced by $\theta{pet}^{on}$ for the sample of the newer tasks should be smaller than that produced by $\theta{pet}^{off}$ , and the vice versa for the sample of older tasks.

f_{\text {ens }}\left(\mathbf{o}^{o n}, \mathbf{o}^{o f f}\right):=\max \left(\sigma\left(\mathbf{o}^{o n}\right), \sigma\left(\mathbf{o}^{o f f}\right)\right)

where $\sigma$ is the softmax function, $\mathbf{o}^{o n}$ and $\mathbf{o}^{o f f}$ are outputs of the online and offline expert models (i.e., $f(\cdot ; \theta{pet}^{on},\phi)$ and $f(\cdot ; \boldsymbol{\theta}{pet}^{o f f}, \phi)$ for an inference sample, respectively.

chufanchen / read-paper-and-code