When calculating the cosine similarity between the embeddings (mean pooling as implemented using sentence-transformers) of random english words is giving scores well above 0.9 for some reason I can't quite understand. Can you help me understand why this might be happening?
Here is the code to reproduce:
import torch
import numpy as np
import random
from sentence_transformers import SentenceTransformer, util
import seaborn as sns
import matplotlib.pyplot as plt
def randomWords(amount):
# wget https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
with open('words_alpha.txt') as f:
words = f.read().splitlines()
return [random.choice(words) for _ in range(amount)]
model = SentenceTransformer('allenai/longformer-base-4096')
rand_words = randomWords(300)
rand_embeddings = model.encode(rand_words)
rand_rand_similarities = np.array(util.cos_sim(rand_embeddings, rand_embeddings))
# plot distribution of similarity scores
fig = plt.figure(figsize=(10,5))
sns.histplot(rand_rand_similarities.flatten(), label='rand-rand')
plt.legend()
plt.show()
When calculating the cosine similarity between the embeddings (mean pooling as implemented using sentence-transformers) of random english words is giving scores well above 0.9 for some reason I can't quite understand. Can you help me understand why this might be happening?
Here is the code to reproduce: