UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.79k stars 2.43k forks source link

Is the sentence embedding normalized? #233

Open shaofengzeng opened 4 years ago

shaofengzeng commented 4 years ago

i.e. norm(v)=1, where v is the vector (512 dimensions) of a sentence

nreimers commented 4 years ago

No, scores are not normalized.

lefnire commented 4 years ago

I usually have range -7 7. Just sklearn.preprocess.minmax_scale(vecs) if you need normalized

PhilipMay commented 4 years ago

I prefer sklearn.preprocessing.normalizeto normalize the vectors to unit length without chaning the direction.

lefnire commented 4 years ago

gk, thanks! (kinda why I dropped that, in case someone had something better)

lefnire commented 3 years ago

Question: why row-normalize in utils.py, my intuition is to column-normalize (assuming you have a decent corpus so there's some distribution). Aren't the embedding-columns features in their own way? Asking because I see the row-norm method all over the internet too, can't find any intuition

nreimers commented 3 years ago

Hi @lefnire Sentence embeddings are supposed to be independent, i.e., they should only rely on the input text and not on the other texts you also pass for encoding.

Some approaches row-normalize the embeddings, which is better known to normalize the vector to have unit length (i.e. ||v|| = 1). In that case, cosine similarity is equal to the dot product.

This is important for example in approximate nearest neighbor search, some implementations like Faiss expect that the vectors have unit length.

Best Nils Reimers

lefnire commented 3 years ago

Thanks @nreimers !

microcoder-py commented 1 year ago

@nreimers but is there a defined bound for the max/min values for the sentence embeddings? i.e. would scaling to [-1, 1] preserve the distance metrics when cosine similarity is applied?