Topic representation output incomprehensible

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

6.08k stars 757 forks source link

Topic representation output incomprehensible #1036

Closed mplockhart closed 1 year ago

mplockhart commented 1 year ago

I believe this will have a huge impact on the way BERTopic works, but as it stands I am unable to obtain useful information from the representation_model.

I am installing fresh on a Google Colab notebook and using the /content/kurzgesagt_sentences.csv' data set as mentioned on the representation pages.

I have been pre-processing the sentences using the embeddings method and reading in the embeddings to the topic model (embeddings=embeddings).

The code I have been using is:

from bertopic.representation import TextGeneration # Create your representation model representation_model = TextGeneration('gpt2') # Use the representation model in BERTopic on top of the default pipeline topic_model = BERTopic(representation_model=representation_model, verbose=True) topics, _ = topic_model.fit_transform(docs, embeddings=embeddings)

Below is the outcome of the above code. Is there something I have missed in running this/is it possible to provide a more complete example? Maybe in a notebook?

Best Mike

MaartenGr commented 1 year ago

It is difficult to say what exactly is happening here. However, I think there are two things of influence. First, you mention that you did some preprocessing of the documents before passing them to BERTopic. It might be that there is something there but it is difficult to say without seeing the full code. Could you share your entire code for getting this output?

Second, the "gpt2" might not have been the best example here as there are a couple of models that perform significantly better. For example, the google/flan-t5 models are much better at generating text, and have I had a much better experience with the output. The following often works well for me:

from transformers import pipeline
from bertopic.representation import TextGeneration

# `google/flan-t5-xl` also works great but requires more compute
generator = pipeline('text2text-generation', model='google/flan-t5-large')  
representation_model = TextGeneration(generator)

mplockhart commented 1 year ago

With regard to the pre-processing, I only mean in terms of embedding:

# random.seed(42)
random.Random(42).shuffle(docs)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(docs, show_progress_bar=True)

But I appreciate that could have had an alternative meaning.

I have tried the modelling with the Google model above, but found high levels of repetition in the results. Maybe my solution is just to find alternative models to use for my desired topic headers.

Thank you for your assistance.

rubypnchl commented 1 year ago

It is difficult to say what exactly is happening here. However, I think there are two things of influence. First, you mention that you did some preprocessing of the documents before passing them to BERTopic. It might be that there is something there but it is difficult to say without seeing the full code. Could you share your entire code for getting this output?

Second, the "gpt2" might not have been the best example here as there are a couple of models that perform significantly better. For example, the google/flan-t5 models are much better at generating text, and have I had a much better experience with the output. The following often works well for me:
from transformers import pipeline
from bertopic.representation import TextGeneration

# `google/flan-t5-xl` also works great but requires more compute
generator = pipeline('text2text-generation', model='google/flan-t5-large')  
representation_model = TextGeneration(generator)

I am getting better topic labels using google model but how can we validate the retrieved labels For e.g.,

The transformer representation is : beef

But while using representation model google/gpt-2 , we are missing the default representation means the topic keywords, In such case, how can we validate our topic labels are actually representing the topics? And, How can we retrieve the topic keywords as well because I need these keywords for further processing

MaartenGr commented 1 year ago

@rubypnchl You could use the topic labels by 'google/flan-t5-large' and set them as custom labels using the guide here. After having done that, you can then update the topic representation and remove the topic labels with keywords by using the guide here for updating the topic representation after training the model.

It will be something like this:

from transformers import pipeline
from bertopic.representation import TextGeneration, KeyBERTInspired

# `google/flan-t5-xl` also works great but requires more compute
generator = pipeline('text2text-generation', model='google/flan-t5-large')  
representation_model = TextGeneration(generator)

# Fit model
topic_model = BERTopic(representation_model=representation_model).fit(docs)

# Create and set custom labels
labels = [topic_model.get_topic(topic)[0][0] for topic in sorted(list(set(topic_model.topics_))]
topic_model.set_topic_labels(labels)

# Then, revert update representation back to keywords and use KeyBERT instead
topic_model.update_topics(docs, representation_model=KeyBERTInspired())

Another way is to adjust the code here such that the generated label with 'google/flan-t5-large' is simply added to the keyword list.