Matching with Synonyms using KeyLLM OR KeyBERT

MaartenGr / KeyBERT

Minimal keyword extraction with BERT

https://MaartenGr.github.io/KeyBERT/

MIT License

3.43k stars 342 forks source link

Matching with Synonyms using KeyLLM OR KeyBERT #245

Open ChettakattuA opened 1 month ago

ChettakattuA commented 1 month ago

I have been playing with KeyBERT and KeyLLM for a while now. And here is something I would like to achieve.

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

Text = "CO2 emissions are high these days" candidate keyword list have the word ["Carbon dioxide"] and not "CO2"

Expected output = ["Carbon dioxide"]

MaartenGr commented 1 month ago

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

I think it should be possible if you use it as a candidate word. Have you tried it out?

ChettakattuA commented 1 month ago

In this result the acronym and synonyms are not identified by KeyBERT

acronym used = CO2 -> carbon dioxide
synonym used = emission -> release
Plural = emission -> emissions

The code used

from keybert import KeyBERT 
kw_model = KeyBERT() 
text = "CO2 emissions are high these days"
can = ["carbon dioxide", "emissions","release","emission","co2"]
Keywords = kw_model.extract_keywords(text,candidates=can)

Is there some way to resolve this?

MaartenGr commented 1 month ago

Ah right, that's because the candidates should appear in the original document in order to find them. Instead, you might want to use the seed_keywords parameter which allows you to steer the model towards certain words. Note that you might have to use the global perspective here.

ChettakattuA commented 1 month ago

But do you know why its require the word itself to appear in the text? What I understood from the documentation is it uses embeddings and cosine similarity. Aint it enough to understand similar words or synonyms from the text and candidates?

MaartenGr commented 4 weeks ago

@ChettakattuA That depends on what you want. Generally, keywords are derived directly from the article that was written for SEO reasons. In KeyBERT candidates are passed to the CountVectorizer as a vocabulary, which means they should appear in the original documents (as they are fitted on the original documents):

https://github.com/MaartenGr/KeyBERT/blob/f0f96a6d524ad1403bd847b05c8345cf099ed060/keybert/_model.py#L163-L182