UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.25k stars 2.47k forks source link

Find n most similar contextual word embeddings to one or multiple sentence embeddings? #237

Open 9j7axvsLuF opened 4 years ago

9j7axvsLuF commented 4 years ago

I've been using sentence-transformers for a little while and I love it - thanks for your great work! Out of the box I've been getting the best results for sentence similarity tasks with the pre-trained 'roberta-large-nli-stsb-mean-tokens' model.

However I was wondering if it's possible to find the n most similar (contextual) word embeddings to either one or several sentence embeddings in the space? The use case would be for example to get specific key words for a group of sentences, for topic modelling. For several sentence embeddings I could get the mean vector, but how do I find closest words to this mean sentence vector in the vector space then without using a predetermined list of words?

Many thanks in advance!

nreimers commented 4 years ago

Hi, you would need to ouput all embeddings (you can pass a parameter to the encode function that you would like to get the token embeddings). If you have many sentences / many tokens, you would need to index them with for example faiss.

Then, given a vector representing the mean, you can use faiss and retrieve all token embeddings that are close.

A challenge will be that BERT & Co. do not stick to words, instead, they use word pieces. A word can be broken down to many pieces, for example, the word 'embeddings' might be broken down to 'emb', 'ed', 'ding', 's' and you would get 4 embeddings for each word piece.

The embedding for 'ding' might be close to your vector, but the others might be far away.

9j7axvsLuF commented 4 years ago

I see, thanks a lot!

Might it be better then to have a huge list of words of interest that I would encode with sentence-transformers just like I encode sentences, and use faiss to find word vectors (from this list) closest to mean sentence vectors? This would bypass the issue of word pieces, but I suppose the problem would then be that the resulting word vectors would not be properly contextualized.

While I'm here, I also wanted to ask your opinion about which of your pretrained models is best for general sentence similarity task (on a dataset involving domain-general conversations). Do you agree that 'roberta-large-nli-stsb-mean-tokens' is currently the best?

nreimers commented 4 years ago

Yes, there you would loose the meaning of the contextualized.

It is hard to say which is the currently best models. It extremely depends on your task and I would say there cannot be one perfect model. Instead, it always heavily depend on your task and your notion for similarity. That notion is different for each task.

9j7axvsLuF commented 4 years ago

Thanks Nils, really appreciate your guidance!

If I may, I have two follow up questions:

  1. Let's say I use the output_value='token_embeddings' argument when calling model.encode, how do I know which word pieces each vector corresponds to? For example let's say I want to take the mean vector of all word-piece-vectors corresponding to a word to get a vector for that word; I need to know which word-piece-vectors to pool for that word.

  2. I was wondering wether you intend to release a version of roberta-large fine-tuned on NLI and STSB with WKPooling? I'm curious to check whether WKPooling would result in improvements on my downstream tasks.

nreimers commented 4 years ago

Hi @9j7axvsLuF 1) I sadly haven't found a good way for this yet. BERT applies first tokenization, and then word piece splitting on the individual tokens. This could be modeled, you would also need to tokenize the text and perform word piece splitting. However, other models use Sentence piece, which performs tokenization and splitting of words in one step. There, I have so far no good solution how to find the mapping.

2) With this code snippet, you can create a WK model from any Transformer pre-trained model, i.e., also for roberta large etc. ´´´ import sys from sentence_transformers import models, SentenceTransformer

print(sys.argv[1], "=>", sys.argv[2])

1) Point the transformer model to the BERT / RoBERTa etc. model you would like to use. Ensure that output_hidden_states is true

word_embedding_model = models.Transformer(sys.argv[1], model_args={'output_hidden_states': True})

2) Add WKPooling

pooling_model = models.WKPooling(word_embedding_model.get_word_embedding_dimension())

3) Create a sentence transformer model to glue both models together

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Save

model.save(sys.argv[2])



Usage: python convert.py path/to/roberta-large-model output/path/for/wk/model

You might have to first download the roberta large fined models and unzip it to get the used & fine-tuned roberta model.

Best
Nils Reimers
9j7axvsLuF commented 4 years ago

Thanks a lot!

shameelct commented 4 years ago

Hi @9j7axvsLuF @nreimers Are there any good solutions to the question 1 asked by @9j7axvsLuF

alex2awesome commented 4 years ago

Hi @nreimers, if I'm not mistaken, there's a bug in WK-pooling occurring at line 41:

https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/WKPooling.py#L41

unmask_num[sent_index] returns a float which cannot be used to slice a tensor. I get the following error when I implement your code snippet above.

A fix is simply: int(unmask_num[sent_index])

Screen Shot 2020-07-29 at 6 36 52 PM