Reading: Visualizing and Measuring the Geometry of BERT

0. Paper

paper: arxiv

They analyze contextualized word representations from BERT.

Previous work has found that BERT embeddings capture grammatical information (dependency tree) and it relates to Euclidean distance. -> This work analyzes more detail.
Previous works have analyzed embeddings from the Pipeline of BERT (pos tagging, coreference resolution, dependency labeling). -> This work analyzes internal representations from BERT.

To analyze the grammatical information, they use model-wise attention vector. スクリーンショット 2023-02-07 10 53 17

Previous work found that the l2 distance between contextualized embeddings of BERT captures a dependency tree. スクリーンショット 2023-02-06 23 29 30

Figure 3 shows the average l2 distance of model-wise attention vector between two words with a given dependency label detects:

distant dependency such as parataxis (relation between the main verb of a clause and other sentential elements)
close dependency such as auxpass (passive information)

スクリーンショット 2023-02-07 10 59 39

Figure 4 shows that BERT embeddings capture semantic information. スクリーンショット 2023-02-06 23 29 52