UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

How can i use finetuned sbert embedding’s Arithmetic Properties ? #1016

Open svjack opened 3 years ago

svjack commented 3 years ago

I finetuned some embedding and perform some subtraction above a subset of some sentence’s embedding. These sentences are similar in the sense of edit distance. And i hope this will perform some sense of Disambiguation as word2vec can do. I find sbert’s embedding can do it best. How can i use this properties properly ?

svjack commented 3 years ago

It seems this can eliminate some entity effect to make the embedding perform more generally. Because i can perform this Arithmetic operation among a subset (which can use combination to iter over many choices) and can also use the highest cos score as filter criterion. The will make this as a method to generate new embedding when the sentence sample number is small or the embedding semantic similarities have some correlation. Can you provide me some material or projects which explore this deep ?

nreimers commented 3 years ago

Hi @svjack I'm not aware that someone has tested this. But would be happy if you could share your experiences here, if it works.

svjack commented 3 years ago

Hi @svjack I'm not aware that someone has tested this. But would be happy if you could share your experiences here, if it works.

In the truly usage of information retrieve. We can use search engine to extract some text span, that we can use to train bi_encoder. but these span may always lack generalized. Because they may always have similarities connections by some tokens(these tokens can be extract as a kind of entity) If we can replace them by same kind entity dictionary. Then we can make the dataset more generalize.

If this can be done by operation among embeddings contain same entity, will save some works. And this seems require pretrained bi_encoder.

The situation will be, if i train the dataset on some sub category as A B C if i perform embedding difference among A, and A have some semantic relatedness with B, then the difference embedding may more close to B and with the semantic from A but drop the entity inference from A.

for example, A is about software questions, B is about data (use software to manipulate) , then some diff embedding of A is related with B (use cos to retrieve, some topK all about A and B) and many drop the software’s name effect.

use this diff of A to retrieve B many more generalize. (this require bi_encoder pretrained among ABC)

svjack commented 3 years ago

Hi @svjack I'm not aware that someone has tested this. But would be happy if you could share your experiences here, if it works.

And i want to consider use sbert bi-encoder as a keyword replacement measurement. Consider i have a dataset with title and content. The title must be the most important feature of single sample, then if i extract some keywords from content, and replace these words with some keywords in title (if use entity replacement will be more precious) then i can produce a lot of fake-title by content keyword. and i calculate sbert cos among title and fake-title. as a contribution weight from keyword to topic this may work’s for unsupervised classification task. And because the content may be long text. Is there exists similar construction to use these kind of cos weight connections to shorten the gap between bert and long text feature representations ? Or some similar nlpaug method can connect title and content, can sbert do some works in this domain? Supervised or unsupervised ? And i know the keyword extraction may be a discrete process that can not make gradients pass it (but can use some policy gradient methods to replace it or bayesian to sample from it)

svjack commented 3 years ago

I review the code example in beir, it seems like you may prefer to use https://github.com/castorini/docTTTTTquery to have a representation of “title” from content.