Closed chufanchen closed 7 months ago
Setting: Pre-train $\rightarrow$ downstream adaptation in continual learning
Existing methods: L2P, DualPrompt, ESN
Limitations:
To accommodate various PET methods, a key challenge is that different PET modules have different adaptation speeds (for novel tasks), as well as different forgetting speeds (for historical tasks). $\rightarrow$ learning with calibrated speed
model $f(\cdot;\theta, \phi)$
Pre-trained feature extractors: $\theta_{pre}$
Classifier $\phi{old}$ for all learned tasks $\mathcal{T}{1:i}$
Classifier $\phi{new}$ for current learning task $\mathcal{T}{i}$
Naive Baseline: Seq-FT
At the beginning of the training, we only learn $\phi{new}$ with $\theta{pet}$ frozen;
Then after $\phi{new}$ has sufficiently learned and the loss has significantly decreased, we jointly learn both $\phi{new}$ and $\theta_{pet}$.
Inspired by Complementary Learning System of the human brain, we define two experts hippocampus-like online PET module $\theta{pet}^{on}$(i.e. the $\theta{pet}$) and the neocortex-like offline PET module $\theta_{pet}^{off}$.
The $\boldsymbol{\theta}_{\text {pet }}^{\text {off }}$ slowly accumulates the learned knowledge when the model learns a new task by an accumulation function, and we empirically find the simple Exponential Moving Average (EMA) algorithm works well for our LAE:
\boldsymbol{\theta}_{\text {pet }}^{\text {off }} \leftarrow \alpha \cdot \boldsymbol{\theta}_{\text {pet }}^{\text {off }}+(1-\alpha) \cdot \boldsymbol{\theta}_{\text {pet }}^{\text {on }},
where $\alpha \in(0,1)$ is a large (i.e., close to 1 ) weight decay.
A classifier can be viewed as an energy model when we define the unnormalized negative log probability as the energy function. The energy produced by $\theta{pet}^{on}$ for the sample of the newer tasks should be smaller than that produced by $\theta{pet}^{off}$ , and the vice versa for the sample of older tasks.
f_{\text {ens }}\left(\mathbf{o}^{o n}, \mathbf{o}^{o f f}\right):=\max \left(\sigma\left(\mathbf{o}^{o n}\right), \sigma\left(\mathbf{o}^{o f f}\right)\right)
where $\sigma$ is the softmax function, $\mathbf{o}^{o n}$ and $\mathbf{o}^{o f f}$ are outputs of the online and offline expert models (i.e., $f(\cdot ; \theta{pet}^{on},\phi)$ and $f(\cdot ; \boldsymbol{\theta}{pet}^{o f f}, \phi)$ for an inference sample, respectively.
https://arxiv.org/abs/2303.10070
https://github.com/gqk/lae