Does keybert pay attention to the syntax

MaartenGr / KeyBERT

Minimal keyword extraction with BERT

https://MaartenGr.github.io/KeyBERT/

MIT License

3.52k stars 348 forks source link

Does keybert pay attention to the syntax #160

Open 652994331 opened 1 year ago

652994331 commented 1 year ago

keyBert is a masterpiece, really appreciate for this kind of good work. recently I am looking into keyBert using scenario like different language setting. For example I am using keybert in Chinese. I modify the word segmentation part and used multi linguistic pretrain language model. it did worked, pretty good results. However, I found a problem then. sometimes keybert give u an import word like "car drive"(eg I am using ngrams = 2) but the right syntax is "drive car" . It seems keybert finds the right information but has some problems with syntax? or maybe I am using the keybert in a wrong way

MaartenGr commented 1 year ago

Thank you for your kind words! Could you share your code for doing this? Perhaps I can find something happening there.

What KeyBERT is doing is generating n-grams based on their appearance in the texts. For example, if the text is "This is a car I drive" and you remove stopwords, then "car drive" will be created. This vectorization step does not take into account what syntax we as humans would prefer, it merely recreates them from the input and as such is actually closer to what the original document intended.

It might also be that there is a bug going one somewhere but if it isn't, the results you get could be explained by the above.

phuclh commented 1 year ago

@MaartenGr Here is an example that I get weird keywords.

from keybert import KeyBERT

doc = """
          ORGANIC DOG FOOD (Free Shipping).
         Organic Dog Food.
         Best Organic Dog Food.
         The 7 Best Organic Dog Foods.
         7 Best Organic Dog Foods [2023 Reviews]
      """

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, seed_keywords=['best organic dry dog food', 'organic dry dog food'], keyphrase_ngram_range=(4, 5))
print(keywords)

The result

[('organic dog food best organic', 0.7935), ('organic dog foods best organic', 0.7927), ('dog food best organic', 0.7727), ('organic dog foods best', 0.7722), ('best organic dog food', 0.771)]

Some weird keywords are organic dog food best organic, organic dog foods best, organic dog foods best organic, dog food best organic. The most expected keyword is dog food best organic which has the lowest score.

I didn't use stop words in this case.

MaartenGr commented 1 year ago

This is most likely a result of the tokenization process of the CountVectorizer. You could use the parameters tokenizer and analyzer to implement custom solutions to solve your issue.