Closed GengYuIsland closed 2 weeks ago
The shape of word_embeddings is different with keywords.
That's correct and intended behavior! The reason why they differ is that .extract_embeddings
extracts the embeddings from all words in the documents. These are then fed to .extract_keywords
to extract a subset of words that will serve as keywords.
As such, if you want the embeddings of the keywords, you would have to generate them yourself.
How do I match the keywords to the vectors in word_embeddings? word_embeddings doesn't contain a vector of all words, I'm guessing it's stop words are removed. This results in me not being able to locate the corresponding vectors in word_embeddings based on the order of the keywords in the sentence. This is the point.
I had to use Sentence-Bert to embed the keyword because I see it used at the bottom of your code. Does this approach make sense? After all, to my knowledge, Sentence-Bert embeds sentences, not words.
I had to use Sentence-Bert to embed the keyword because I see it used at the bottom of your code. Does this approach make sense? After all, to my knowledge, Sentence-Bert embeds sentences, not words.
It does. Let me start by saying that sentence-transformers is not a single model but a framework that can use different models. In practice, although these models do generate embeddings for sentences/paragraphs that does not mean it cannot or should not embed words. These types of models often generate contextual word/token embeddings and sometimes do a simple procedure like averaging the token embeddings. As such, it can definitely generate word embeddings and it does so quite well.
Thank you MaartenGr, you solved my problem!
As the code:
The shape of word_embeddings is different with keywords. How can I get the embedding of keywords? I'm really comfused.