Open shaofengzeng opened 4 years ago
No, scores are not normalized.
I usually have range -7 7. Just sklearn.preprocess.minmax_scale(vecs) if you need normalized
I prefer sklearn.preprocessing.normalize
to normalize the vectors to unit length without chaning the direction.
gk, thanks! (kinda why I dropped that, in case someone had something better)
Question: why row-normalize in utils.py, my intuition is to column-normalize (assuming you have a decent corpus so there's some distribution). Aren't the embedding-columns features in their own way? Asking because I see the row-norm method all over the internet too, can't find any intuition
Hi @lefnire Sentence embeddings are supposed to be independent, i.e., they should only rely on the input text and not on the other texts you also pass for encoding.
Some approaches row-normalize the embeddings, which is better known to normalize the vector to have unit length (i.e. ||v|| = 1). In that case, cosine similarity is equal to the dot product.
This is important for example in approximate nearest neighbor search, some implementations like Faiss expect that the vectors have unit length.
Best Nils Reimers
Thanks @nreimers !
@nreimers but is there a defined bound for the max/min values for the sentence embeddings? i.e. would scaling to [-1, 1] preserve the distance metrics when cosine similarity is applied?
i.e. norm(v)=1, where v is the vector (512 dimensions) of a sentence