Reading: Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning

0. Paper

paper: [link]
Findings of EMNLP2023

1. What is it?

They propose a metric to predict the optimal (well-specialized) layer for fine-tuning.

2. What is amazing compared to previous works?

Their metric does not require any training or hyperparameter tuning. Their optimal layer effectively performs downstream tasks with 500% more computation cost.

3. Where is the key to technologies and techniques?

3.1 Metric

スクリーンショット 2023-06-19 13 14 09

They define the metric to evaluate the optimal (well-specialized) layer as follows:

calculate sequence (sentence) embeddings by averaging all tokens (except for CLS token)
clustering by the target labels (blue, red, and yellow in Figure 1)
calculate the within-group variability and between-group variability as:
based on these scores, they define the task speciality of the layer

3.2 Tuning strategy

スクリーンショット 2023-06-19 13 19 55

Based on the optimal layer, they tried four tuning strategies in Figure 2.

4. How did evaluate it?

スクリーンショット 2023-06-19 13 22 01 Figure 3 shows that their metric is highly correlated with task performance.

スクリーンショット 2023-06-19 13 24 58 Figures 4 and 5 show that their tuning strategies (tuning only optimal layers) achieve comparable performance to tuning all of the layers.

a1da4 / paper-survey