Open AroundtheGlobe opened 1 year ago
This is indeed a result of the CountVectorizer processing the input to be lowercase. In a previous version of KeyBERT the candidates were handled much less efficiently. By passing them directly to the CountVectorizer, this process of candidate generation and selection is much faster. What might be worthwhile to circumvent this issue is by automatically lower-casing the candidate words in KeyBERT but I typically like to prevent doing additional processing in case users only want upper-casing to be matched.
Thank you for clarifying and I've updated my code as you described and it now works flawlessly again.
No scores are returned when you provide the
candidates
parameter for KeyBERT()Shows the warning message:
and keywords variable is returned empty.
Without the candidates paramater it does return a result with scores:
keywords = kw_model.extract_keywords(doc)
Result:[('griekenland', 0.5619), ('zonovergoten', 0.5024), ('bekend', 0.4398), ('prachtige', 0.4118), ('terecht', 0.4039)]
When I change the candidates words to lower case words or when I add
lowercase=False
to the CountVectorizer it seems to return the words with a score as expected.:keywords = kw_model.extract_keywords(doc, candidates=['griekenland', 'kos'])
In version 0.6.0 of KeyBERT() it wasn't an issue if the candidates words where capitalized.
Strangely enough it does seem to work on one of the virtual environments I've been using for a while, but I can't get it to work on newly installed environments even when I replicate it with the same versions of the packages installed. I expected the bug was in one of the installed packages, but this does not seem the case.