0. Paper

@inproceedings{zhou-etal-2022-problems, title = "Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words", author = "Zhou, Kaitlyn and Ethayarajh, Kawin and Card, Dallas and Jurafsky, Dan", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-short.45", doi = "10.18653/v1/2022.acl-short.45", pages = "401--423", abstract = "Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.", }

1. What is it?

They analyze the impact of high-frequency words on calculating semantic similarity.

2. What is amazing compared to previous works?

They find that why pre-trained language models underestimate the semantic similarity of high-frequency words.

3. Where is the key to technologies and techniques?

4. How did evaluate it?

4.1 Effect of frequency on semantic similarity (cosine)

data: word in context (WiC), annotated same or different meaning model: pre-trained BERT base From Fig.1, there is a negative correlation between word frequency and cosine similarity.

data: Stanford Contextualized Word Similarity dataset (SCWS), annotated the similarity of two words in context (scale of 1 to 10) model: pre-trained BERT base From Fig.2, the model predicts lower similarity than human judgments in high-frequency words.

4.2 Mechanism of underestimation

They analyze the geometry of vector spaces.

data: Wikipedia, sampled 10 sentence for each word. From Fig.3, there is a positive correlation between word frequency and size of bounding hypersphere.

Their theoretical intuition is shown in Figure 4. High-frequency words make larger bounding balls, which causes lower cosine similarity (underestimate).

a1da4 / paper-survey

Reading: Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words #236