davidberenstein1957 / fast-sentence-transformers

Simply, faster, sentence-transformers
MIT License
140 stars 10 forks source link

output when using normalize_embeddings=True is not correct. Dimension of output is reduced. #3

Closed arrmansa closed 2 years ago

arrmansa commented 2 years ago

Minimum reproducible example

The following code

from fast_sentence_transformers import FastSentenceTransformer as SentenceTransformer

# use any sentence-transformer
encoder = SentenceTransformer("all-MiniLM-L6-v2", device="cpu", quantize=False)

print(encoder.encode("Hello hello, hey, hello hello").shape)

print(encoder.encode("Hello hello, hey, hello hello", normalize_embeddings=True).shape)

has the output

(384,)
()

when it should be

(384,)
(384,)

Fix

The code in line 292 of FastSentenceTransformer.py is

embeddings = np.linalg.norm(embeddings, ord=2, axis=1, keepdims=False)

I think it should be either

embeddings = embeddings/np.linalg.norm(embeddings, ord=2, axis=1, keepdims=True)

or

norms = np.linalg.norm(embeddings, ord=2, axis=1, keepdims=True)
embeddings = embeddings/np.where(norms<1e-12, 1e-12, norms)

so that the output is same as the original SentenceTransformer.py.

Supporting code and output

code:

import numpy as np
import torch

np.random.seed(4)
#define a vector
np_test_vector = np.random.uniform(0,1, (3,10)) #1e-14 * np.ones((3,10))
print("shape of np test vector:", np_test_vector.shape)
torch_test_vector = torch.tensor(np_test_vector)
print("shape of torch test vector:", torch_test_vector.shape)

#From (line 184) https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py
original_norm = torch.nn.functional.normalize(torch_test_vector, p=2, dim=1)
print("shape of torch normalized:", original_norm.shape)
print("sum of squared is:", torch.sum(original_norm[0]**2))

#From (line 292) https://github.com/Pandora-Intelligence/fast-sentence-transformers/blob/main/fast_sentence_transformers/FastSentenceTransformer.py
#Incorrect Shape
new_norm = np.linalg.norm(np_test_vector, ord=2, axis=1, keepdims=False)
print("shape of current numpy code:", new_norm.shape)
print("we need to divide the original by these values:", new_norm)

#Fixed code
fixed_norm = (np_test_vector/np.linalg.norm(np_test_vector, ord=2, axis=1, keepdims=True))
print("shape from corrected code:", fixed_norm.shape)

#To verify
print("sum absolute difference to verify:", torch.sum(torch.abs(original_norm - torch.tensor(fixed_norm))))

output:

shape of np test vector: (3, 10)
shape of torch test vector: torch.Size([3, 10])
shape of torch normalized: torch.Size([3, 10])
sum of squared is: tensor(1., dtype=torch.float64)
shape of current numpy code: (3,)
we need to divide the original by these values: [2.10508935 1.95157797 1.89443896]
shape from corrected code: (3, 10)
sum absolute difference to verify: tensor(3.6776e-16, dtype=torch.float64)

Other notes

if the eps=1e-12 in the original documentation is required the following can be used.

norm_vector = np.linalg.norm(np_test_vector, ord=2, axis=1, keepdims=True)
fixed_norm = np_test_vector/np.where(norm_vector<1e-12, 1e-12, norm_vector)

We can confirm that it is working by using np_test_vector = 1e-14 * np.ones((3,10)) in the supporting code

davidberenstein1957 commented 2 years ago

@arrmansa thank you for your feedback. Could you open a PR for this?

arrmansa commented 2 years ago

created https://github.com/Pandora-Intelligence/fast-sentence-transformers/pull/4