Open 652994331 opened 1 year ago
Thank you for your kind words! Could you share your code for doing this? Perhaps I can find something happening there.
What KeyBERT is doing is generating n-grams based on their appearance in the texts. For example, if the text is "This is a car I drive" and you remove stopwords, then "car drive" will be created. This vectorization step does not take into account what syntax we as humans would prefer, it merely recreates them from the input and as such is actually closer to what the original document intended.
It might also be that there is a bug going one somewhere but if it isn't, the results you get could be explained by the above.
@MaartenGr Here is an example that I get weird keywords.
from keybert import KeyBERT
doc = """
ORGANIC DOG FOOD (Free Shipping).
Organic Dog Food.
Best Organic Dog Food.
The 7 Best Organic Dog Foods.
7 Best Organic Dog Foods [2023 Reviews]
"""
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, seed_keywords=['best organic dry dog food', 'organic dry dog food'], keyphrase_ngram_range=(4, 5))
print(keywords)
The result
[('organic dog food best organic', 0.7935), ('organic dog foods best organic', 0.7927), ('dog food best organic', 0.7727), ('organic dog foods best', 0.7722), ('best organic dog food', 0.771)]
Some weird keywords are organic dog food best organic
, organic dog foods best
, organic dog foods best organic
, dog food best organic
. The most expected keyword is dog food best organic
which has the lowest score.
I didn't use stop words in this case.
This is most likely a result of the tokenization process of the CountVectorizer. You could use the parameters tokenizer
and analyzer
to implement custom solutions to solve your issue.
keyBert is a masterpiece, really appreciate for this kind of good work. recently I am looking into keyBert using scenario like different language setting. For example I am using keybert in Chinese. I modify the word segmentation part and used multi linguistic pretrain language model. it did worked, pretty good results. However, I found a problem then. sometimes keybert give u an import word like "car drive"(eg I am using ngrams = 2) but the right syntax is "drive car" . It seems keybert finds the right information but has some problems with syntax? or maybe I am using the keybert in a wrong way