Open Bougeant opened 5 months ago
Thanks for sharing this extensive description of the issue!
It's not a bug but a feature! Just kidding... but I do wonder (and agree with you) whether this is a bug or just something annoying in specific use cases.
I'm not sure if this is a bug, if users are supposed to convert docs to lower case in the first place, or whether users must use the CountVectorizer(lowercase=False) option.
In all honesty, I highly doubt that many users use lowercase=False
. Personally, I think the CountVectorizer
is the unsung hero of BERTopic and should deserve a bit more attention. It also means that these kinds of potentially unexpected behaviors need to be explained a bit more in-depth.
This is probably not ideal though because we can get the same words with different casing (Civil vs civil)
Yeah, this is rather tricky because you can prevent this problem by using MMR on top of that. It would, however, hurt representations that do not use additional representation models.
All in all, I think I would not opt for a code-based fix but rather a documentation-based fix. It is very specific behavior that mostly affects the PartOfSpeech model but also others to a certain extend.
What do you think about adding a note to the PartOfSpeech documentation page? Perhaps also extend the documentation of the CountVectorizer (for example here). Would that have been sufficient for you when you initially faced this problem?
As of v0.16.2, there is a potential bug with the PartOfSpeech class (very cool feature by the way!)
Indeed, by default, the CountVectorizer converts the tokens to lower case. In the PartOfSpeech
extract_topics
method, thecandidate_keywords
are searched in thewords_lookup
(word_indices = [words_lookup.get(keyword) for keyword in candidate_keywords if words_lookup.get(keyword)]
).However, the
words_lookup
is composed of lower case words by default (since they are generated using the CountVectorizer), while the candidate_keywords have the original case from the documents.I'm not sure if this is a bug, if users are supposed to convert docs to lower case in the first place, or whether users must use the
CountVectorizer(lowercase=False)
option.Here's an example to show what's going on:
This results in the following topics:
As can be seen, there are no proper nouns here, even though the 3rd topic could only get 4 keywords because it could not find additional words matching the pattern to represent the topic, despite the proper nouns
Kentucky
andIndiana
being available. This is because by default,words_lookup
contains the lower casekentucky
andindiana
, soKentucky
andIndiana
in thecandidate_keywords
cannot be found.Note that if we use a custom vectorizer which does not convert the tokens to lower case, then the problem goes away:
We then get the following topics, which includes both nouns and proper nouns:
This is probably not ideal though because we can get the same words with different casing (Civil vs civil)