allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.05k stars 276 forks source link

Cosine similarity scores between random words are well above 0.9 #259

Open diogo-p-nunes opened 5 months ago

diogo-p-nunes commented 5 months ago

When calculating the cosine similarity between the embeddings (mean pooling as implemented using sentence-transformers) of random english words is giving scores well above 0.9 for some reason I can't quite understand. Can you help me understand why this might be happening?

Here is the code to reproduce:

import torch
import numpy as np
import random
from sentence_transformers import SentenceTransformer, util
import seaborn as sns
import matplotlib.pyplot as plt

def randomWords(amount):
    # wget https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
    with open('words_alpha.txt') as f: 
        words = f.read().splitlines()
        return [random.choice(words) for _ in range(amount)]

model = SentenceTransformer('allenai/longformer-base-4096')
rand_words = randomWords(300)
rand_embeddings = model.encode(rand_words)
rand_rand_similarities = np.array(util.cos_sim(rand_embeddings, rand_embeddings))

# plot distribution of similarity scores
fig = plt.figure(figsize=(10,5))
sns.histplot(rand_rand_similarities.flatten(), label='rand-rand')
plt.legend()
plt.show()
diogo-p-nunes commented 5 months ago

Here is the distribution of similarities with the code above

Screenshot 2024-06-03 at 15 52 42