Closed mplockhart closed 1 year ago
It is difficult to say what exactly is happening here. However, I think there are two things of influence. First, you mention that you did some preprocessing of the documents before passing them to BERTopic. It might be that there is something there but it is difficult to say without seeing the full code. Could you share your entire code for getting this output?
Second, the "gpt2" might not have been the best example here as there are a couple of models that perform significantly better. For example, the google/flan-t5
models are much better at generating text, and have I had a much better experience with the output. The following often works well for me:
from transformers import pipeline
from bertopic.representation import TextGeneration
# `google/flan-t5-xl` also works great but requires more compute
generator = pipeline('text2text-generation', model='google/flan-t5-large')
representation_model = TextGeneration(generator)
With regard to the pre-processing, I only mean in terms of embedding:
# random.seed(42)
random.Random(42).shuffle(docs)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(docs, show_progress_bar=True)
But I appreciate that could have had an alternative meaning.
I have tried the modelling with the Google model above, but found high levels of repetition in the results. Maybe my solution is just to find alternative models to use for my desired topic headers.
Thank you for your assistance.
It is difficult to say what exactly is happening here. However, I think there are two things of influence. First, you mention that you did some preprocessing of the documents before passing them to BERTopic. It might be that there is something there but it is difficult to say without seeing the full code. Could you share your entire code for getting this output?
Second, the "gpt2" might not have been the best example here as there are a couple of models that perform significantly better. For example, the
google/flan-t5
models are much better at generating text, and have I had a much better experience with the output. The following often works well for me:from transformers import pipeline from bertopic.representation import TextGeneration # `google/flan-t5-xl` also works great but requires more compute generator = pipeline('text2text-generation', model='google/flan-t5-large') representation_model = TextGeneration(generator)
I am getting better topic labels using google model but how can we validate the retrieved labels For e.g.,
the default representation is : meat | organic | food | beef | emissions | eat | of | eating | is
The transformer representation is : beef
But while using representation model google/gpt-2 , we are missing the default representation means the topic keywords, In such case, how can we validate our topic labels are actually representing the topics? And, How can we retrieve the topic keywords as well because I need these keywords for further processing
@rubypnchl You could use the topic labels by 'google/flan-t5-large'
and set them as custom labels using the guide here. After having done that, you can then update the topic representation and remove the topic labels with keywords by using the guide here for updating the topic representation after training the model.
It will be something like this:
from transformers import pipeline
from bertopic.representation import TextGeneration, KeyBERTInspired
# `google/flan-t5-xl` also works great but requires more compute
generator = pipeline('text2text-generation', model='google/flan-t5-large')
representation_model = TextGeneration(generator)
# Fit model
topic_model = BERTopic(representation_model=representation_model).fit(docs)
# Create and set custom labels
labels = [topic_model.get_topic(topic)[0][0] for topic in sorted(list(set(topic_model.topics_))]
topic_model.set_topic_labels(labels)
# Then, revert update representation back to keywords and use KeyBERT instead
topic_model.update_topics(docs, representation_model=KeyBERTInspired())
Another way is to adjust the code here such that the generated label with 'google/flan-t5-large'
is simply added to the keyword list.
I believe this will have a huge impact on the way BERTopic works, but as it stands I am unable to obtain useful information from the
representation_model
.I am installing fresh on a Google Colab notebook and using the
/content/kurzgesagt_sentences.csv'
data set as mentioned on the representation pages.I have been pre-processing the sentences using the embeddings method and reading in the embeddings to the topic model
(embeddings=embeddings)
.The code I have been using is:
from bertopic.representation import TextGeneration
# Create your representation model
representation_model = TextGeneration('gpt2')
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model, verbose=True)
topics, _ = topic_model.fit_transform(docs, embeddings=embeddings)
Below is the outcome of the above code. Is there something I have missed in running this/is it possible to provide a more complete example? Maybe in a notebook?
Best Mike