MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Potential bug with the PartOfSpeech class due to lower case matching #2047

Open Bougeant opened 5 months ago

Bougeant commented 5 months ago

As of v0.16.2, there is a potential bug with the PartOfSpeech class (very cool feature by the way!)

Indeed, by default, the CountVectorizer converts the tokens to lower case. In the PartOfSpeech extract_topics method, the candidate_keywords are searched in the words_lookup (word_indices = [words_lookup.get(keyword) for keyword in candidate_keywords if words_lookup.get(keyword)]).

However, the words_lookup is composed of lower case words by default (since they are generated using the CountVectorizer), while the candidate_keywords have the original case from the documents.

I'm not sure if this is a bug, if users are supposed to convert docs to lower case in the first place, or whether users must use the CountVectorizer(lowercase=False) option.

Here's an example to show what's going on:

import pandas as pd
from bertopic import BERTopic
from bertopic.representation import PartOfSpeech
from sklearn.feature_extraction.text import CountVectorizer

df = pd.DataFrame(
    {
        "text": [
            "Abraham Lincoln (February 12, 1809 – April 15, 1865) was an American lawyer, politician, and statesman who served as the 16th president of the United States from 1861 until his assassination in 1865.",
            "Lincoln led the United States through the American Civil War, defending the nation as a constitutional union, defeating the insurgent Confederacy, playing a major role in the abolition of slavery, expanding the power of the federal government, and modernizing the U.S. economy.",
            "Lincoln was born into poverty in a log cabin in Kentucky and was raised on the frontier, mainly in Indiana.",
            "He was self-educated and became a lawyer, Whig Party leader, Illinois state legislator, and U.S. representative from Illinois.",
        ] * 20
    }
)

patterns = [
    [{"POS": "NOUN"}],
    [{"POS": "PROPN"}],
]

topic_model = BERTopic(
    representation_model=PartOfSpeech(pos_patterns=patterns, top_n_words=5), 
    min_topic_size=10
)

topic_model.fit(df["text"])

topic_model.get_topic_info()

This results in the following topics:

['politician', 'president', 'assassination', 'statesman', 'lawyer']
['nation', 'abolition', 'union', 'economy', 'insurgent']
['log', 'cabin', 'frontier', 'poverty', '']
['leader', 'legislator', 'state', 'representative', 'lawyer']

As can be seen, there are no proper nouns here, even though the 3rd topic could only get 4 keywords because it could not find additional words matching the pattern to represent the topic, despite the proper nouns Kentucky and Indiana being available. This is because by default, words_lookup contains the lower case kentucky and indiana, so Kentucky and Indiana in the candidate_keywords cannot be found.

Note that if we use a custom vectorizer which does not convert the tokens to lower case, then the problem goes away:

topic_model = BERTopic(
    vectorizer_model=CountVectorizer(lowercase=False),
    representation_model=PartOfSpeech(pos_patterns=patterns, top_n_words=5), 
    min_topic_size=10
)

topic_model.fit(df["text"])

topic_model.get_topic_info()

We then get the following topics, which includes both nouns and proper nouns:

['April', 'February', 'politician', 'president', 'assassination']
['nation', 'abolition', 'union', 'Civil', 'economy']
['Kentucky', 'log', 'cabin', 'frontier', 'Indiana']
['Illinois', 'Party', 'leader', 'legislator', 'Whig']

This is probably not ideal though because we can get the same words with different casing (Civil vs civil)

MaartenGr commented 5 months ago

Thanks for sharing this extensive description of the issue!

It's not a bug but a feature! Just kidding... but I do wonder (and agree with you) whether this is a bug or just something annoying in specific use cases.

I'm not sure if this is a bug, if users are supposed to convert docs to lower case in the first place, or whether users must use the CountVectorizer(lowercase=False) option.

In all honesty, I highly doubt that many users use lowercase=False. Personally, I think the CountVectorizer is the unsung hero of BERTopic and should deserve a bit more attention. It also means that these kinds of potentially unexpected behaviors need to be explained a bit more in-depth.

This is probably not ideal though because we can get the same words with different casing (Civil vs civil)

Yeah, this is rather tricky because you can prevent this problem by using MMR on top of that. It would, however, hurt representations that do not use additional representation models.

All in all, I think I would not opt for a code-based fix but rather a documentation-based fix. It is very specific behavior that mostly affects the PartOfSpeech model but also others to a certain extend.

What do you think about adding a note to the PartOfSpeech documentation page? Perhaps also extend the documentation of the CountVectorizer (for example here). Would that have been sufficient for you when you initially faced this problem?