Reading: On the Transformaion of Latent Space in Fine-Tuned NLP Models

0. Paper

paper: link
EMNLP2022

1. What is it?

In this paper, authors further evaluated the methods they proposed in the past (https://aclanthology.org/2022.naacl-main.225) with multiple frameworks.

2. What is amazing compared to previous works?

Previous works try to analyze fine-tuned representations using supervised classification tasks.

Recently, these authors proposed a method for analyzing a relationship between fine-tuned representations and human-defined information (part-of-speech, morph, or chunk)

https://aclanthology.org/2022.naacl-main.225

In that work, they proposed an alignment score, which evaluate how many concept $C_2$ (e.g. NN: noun, singular or mass tagged words) contains words $w$ from concept $C_1$ (e.g. each cluster from fine-tuned LM). スクリーンショット 2023-05-01 13 24 01 スクリーンショット 2023-05-01 13 24 39 In the experiment, they evaluate how many concepts (human-defined) fine-tuned model can align ($\theta = 0.9$) Figure 2 from the paper shows that the rate of aligned concepts $$\frac{Number\ of\ concepts\ Eq.1 = 1}{Number\ of\ concepts}$$ スクリーンショット 2023-05-01 13 35 31

This paper is an upgraded version of the above paper.

3. Where is the key to technologies and techniques?

スクリーンショット 2023-05-01 13 08 38

They adapt concept matching score (https://aclanthology.org/2022.naacl-main.225) for

Embedding space (clusters): pretrained vs. fine-tuned
Human-defined space: pos, morph, chunk labels
Task-specific space: positive/negative labels, NLI

4. How did evaluate it?

4.1 Embedding Space (clusters)

スクリーンショット 2023-05-01 13 09 05

Figure 3 shows that:

the lower layers of the model retain the general language concepts learned in the base model
the upper layers learn task-specific concepts.

4.2 Human-defined (pos, morph, chunk)

スクリーンショット 2023-05-01 13 09 32

Figure 4 shows that models forget the POS information in the upper layers (POS is less important for sentence classification tasks).

4.3 Task-specific (positive / negative)

スクリーンショット 2023-05-01 13 09 48

Figure 5 shows that models learn positive / negative tags in their upper layers.

5. Is there a discussion?

From Figures 3, 4, and 5, only the ALBERT model behaves differently from other models. They concluded that the cause was the Cross-Layer Paremeter Sharing possessed by ALBERT.

a1da4 / paper-survey