UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.72k stars 2.43k forks source link

Sentence Embeddings, is there a way to reduce negative vector points natively? #999

Open cyriltw opened 3 years ago

cyriltw commented 3 years ago

Context: I'm trying to use sentence embeddings with LDA (sckit-learn implementation) and the algorithm needs vectors to be all positive values. I was going through sentence embedding using sentence transformers but embeddings seem to contain negative vectors, thus causing LDA to break.

To give an idea this is an embedding output (a sample of it)

-9.65552405e-04, 1.93892512e-02, 1.43478755e-02, 3.43223996e-02, 2.27693506e-02, 1.48301320e-02, 9.60931331e-02, -4.39086705e-02, 6.71751872e-02, 2.48054438e-03, 1.73567049e-02, 4.11572605e-02, -1.37200188e-02, 4.57149670e-02, -8.68208334e-02, 2.00817361e-02, -4.77351435e-03, 4.09999536e-03, 3.71453092e-02, 2.78526284e-02, 4.01594676e-02, 2.54604034e-03, -6.95970235e-03, -2.80546825e-02, 2.57431027e-02, -4.76267515e-03, 5.37615418e-02, -2.84635816e-02, -2.98351538e-03, 5.11686727e-02, 1.12075582e-01, -3.10318563e-02, 9.31646302e-03, 4.70730141e-02, 2.04491634e-02, 9.88858659e-03,

I am embedding using these few lines;

embedder = SentenceTransformer('paraphrase-mpnet-base-v2')
embedding = embedder.encode(row,normalize_embeddings=True)

Is there a way to obtain embeddings in positive vector space only? (anything natively through the library, other than any transformations to these embeddings)

nreimers commented 3 years ago

Hi, out of the box this is not possible.

Also the individual dimensions of dense embeddings are not really interpretable, so not sure if LDA on top of that will work that well.

cyriltw commented 3 years ago

Hi @nreimers

Thanks for your reply. That is true of individual embeddings not being interpretable. Is it normal for a vector for a sentence to be negative?

nreimers commented 3 years ago

Yes, it is normal. It uses the whole range of the vector space, not only the tiny positive sub-fraction of the vector space

kumarsyamala commented 2 years ago

Hi, I am also facing the same problem as I am trying to use the sentence transformers to create an embedding or vectors to fit_transform with LDA but I am facing a ValueError: Negative values in data passed to LatentDirichletAllocation.fit.

Is there any other way to use the embedding from the sentence transformer or to use negative vectors?

Can we normalize the vectors by using minmaxscaler will it affect the results of LDA or is it good to normalize the vector in the range of 0 to 1?

image

thanks