MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.03k stars 756 forks source link

Adding representation model does not change topic_model.get_document_info results() #2069

Closed SafetyMary closed 1 month ago

SafetyMary commented 3 months ago

Have you searched existing issues? πŸ”Ž

Desribe the bug

Adding representation model does not affect the output of 'Representation' column in topic_model.get_document_info(). To double confirm, I have purposefully created multiple representations using the same model

Reproduce the issue:

# Representation model
generator = pipeline('text2text-generation', model='../../pretrain_models/flan-t5-base')  # I used offline model here
representation_model = TextGeneration(generator)
representation_model = {
   "Main": TextGeneration(generator),
   "Aspect1":  TextGeneration(generator),
   "Aspect2":  TextGeneration(generator) 
}

# Run model
topic_model = BERTopic(nr_topics=10, embedding_model='../../pretrain_models/all-mpnet-base-v2', representation_model=representation_model)  # I used offline model here
vectorizer_model = CountVectorizer(ngram_range=(1, 1), stop_words="english")
topics, probs = topic_model.fit_transform(df['text'].to_list())
topic_model.update_topics(df['text'], vectorizer_model=vectorizer_model)

# show results
topic_model.get_document_info(df['text'])

Expected results: Elements in 'Representation', 'Aspect1' and 'Aspect2' columns should be identical

Actual results: Elements in 'Aspect1' and 'Aspect2' columns are identical but 'Representation' column is different and seems to not have passed through the t5 model

Reproduction

No response

BERTopic Version

0.16.2

MaartenGr commented 2 months ago

Considering the way you used .update_topics, this is expected behavior. What you are doing is overwriting the representation models when you run .update_topics since you did not provide it with the representation models. Instead, you left that as the default (which is None) and therefore the default c-TF-IDF representation are used.

You should do the following instead:

# Representation model
generator = pipeline('text2text-generation', model='../../pretrain_models/flan-t5-base')  # I used offline model here
representation_model = TextGeneration(generator)
representation_model = {
   "Main": TextGeneration(generator),
   "Aspect1":  TextGeneration(generator),
   "Aspect2":  TextGeneration(generator) 
}

# Run model
topic_model = BERTopic(nr_topics=10, embedding_model='../../pretrain_models/all-mpnet-base-v2', representation_model=representation_model)  # I used offline model here
vectorizer_model = CountVectorizer(ngram_range=(1, 1), stop_words="english")
topics, probs = topic_model.fit_transform(df['text'].to_list())

# Use `representation_model`
topic_model.update_topics(df['text'], vectorizer_model=vectorizer_model, representation_model=representation_model)

# show results
topic_model.get_document_info(df['text'])
SafetyMary commented 1 month ago

Sorry for the delayed reply, i have tried your solution and it worked. Thanks a lot.