MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.47k stars 344 forks source link

No scores when candidates parameter is added #149

Open AroundtheGlobe opened 1 year ago

AroundtheGlobe commented 1 year ago

No scores are returned when you provide the candidates parameter for KeyBERT()

from keybert import KeyBERT

doc = """
         Kos. Griekenland staat bekend om de prachtige eilanden waar je terecht kan voor zonovergoten vakanties.
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, candidates=['Griekenland', 'Kos'])

Shows the warning message:

\venv\lib\site-packages\sklearn\feature_extraction\text.py:1369: UserWarning: Upper case characters found in vocabulary while 'lowercase' is True. These entries will not be matched with any documents
  warnings.warn(

and keywords variable is returned empty.

Without the candidates paramater it does return a result with scores: keywords = kw_model.extract_keywords(doc) Result: [('griekenland', 0.5619), ('zonovergoten', 0.5024), ('bekend', 0.4398), ('prachtige', 0.4118), ('terecht', 0.4039)]

When I change the candidates words to lower case words or when I add lowercase=False to the CountVectorizer it seems to return the words with a score as expected.:

keywords = kw_model.extract_keywords(doc, candidates=['griekenland', 'kos'])

In version 0.6.0 of KeyBERT() it wasn't an issue if the candidates words where capitalized.

count = CountVectorizer(
                    ngram_range=keyphrase_ngram_range,
                    stop_words=stop_words,
                    min_df=min_df,
                    vocabulary=candidates,
                    **lowercase=False**
                ).fit(docs)

Strangely enough it does seem to work on one of the virtual environments I've been using for a while, but I can't get it to work on newly installed environments even when I replicate it with the same versions of the packages installed. I expected the bug was in one of the installed packages, but this does not seem the case.

MaartenGr commented 1 year ago

This is indeed a result of the CountVectorizer processing the input to be lowercase. In a previous version of KeyBERT the candidates were handled much less efficiently. By passing them directly to the CountVectorizer, this process of candidate generation and selection is much faster. What might be worthwhile to circumvent this issue is by automatically lower-casing the candidate words in KeyBERT but I typically like to prevent doing additional processing in case users only want upper-casing to be matched.

AroundtheGlobe commented 1 year ago

Thank you for clarifying and I've updated my code as you described and it now works flawlessly again.