MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.31k stars 336 forks source link

How to get the keywords' embedding? #236

Closed GengYuIsland closed 2 weeks ago

GengYuIsland commented 2 weeks ago

As the code:

# Prepare embeddings
docs='Risk communication on the problems regarding endocrine disruptors and release of information on pollutant emission from industrial plants'
# docs  = 'ARC-CONSISTENCY AND ARC-CONSISTENCY AGAIN'
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)

# Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings,stop_words='english')

The shape of word_embeddings is different with keywords. How can I get the embedding of keywords? I'm really comfused.

MaartenGr commented 2 weeks ago

The shape of word_embeddings is different with keywords.

That's correct and intended behavior! The reason why they differ is that .extract_embeddings extracts the embeddings from all words in the documents. These are then fed to .extract_keywords to extract a subset of words that will serve as keywords.

As such, if you want the embeddings of the keywords, you would have to generate them yourself.

GengYuIsland commented 2 weeks ago

How do I match the keywords to the vectors in word_embeddings? word_embeddings doesn't contain a vector of all words, I'm guessing it's stop words are removed. This results in me not being able to locate the corresponding vectors in word_embeddings based on the order of the keywords in the sentence. This is the point.

GengYuIsland commented 2 weeks ago

I had to use Sentence-Bert to embed the keyword because I see it used at the bottom of your code. Does this approach make sense? After all, to my knowledge, Sentence-Bert embeds sentences, not words.

MaartenGr commented 2 weeks ago

I had to use Sentence-Bert to embed the keyword because I see it used at the bottom of your code. Does this approach make sense? After all, to my knowledge, Sentence-Bert embeds sentences, not words.

It does. Let me start by saying that sentence-transformers is not a single model but a framework that can use different models. In practice, although these models do generate embeddings for sentences/paragraphs that does not mean it cannot or should not embed words. These types of models often generate contextual word/token embeddings and sometimes do a simple procedure like averaging the token embeddings. As such, it can definitely generate word embeddings and it does so quite well.

GengYuIsland commented 2 weeks ago

Thank you MaartenGr, you solved my problem!