Closed zynos closed 3 years ago
Hmmm, could you share the code you were using? Also, how many clusters did you have?
Sure:
topic_model = BERTopic(vectorizer_model=CountVectorizer(stop_words="english"))
topics, probs = topic_model.fit_transform(abstracts)
print(topic_model.get_topic_info())
fig = topic_model.visualize_topics()
fig.show()
topics_over_time = topic_model.topics_over_time(abstracts, topics, timestamps, nr_bins=20)
fig2 = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)
And 7 clusters
Topics
Topic Count
0 0 1772
1 1 424
2 2 170
3 3 46
4 4 43
5 5 34
6 -1 21
7 6 18
Does the cluster exist in the topics_over_time
variable? It might be that I tried to ignore the most frequent topic in the visualization (which typically should be -1 if enough topics were created) but instead ignored 0.
Seeing the topics though I would highly advise you lower min_topic_size
or to increase the amount of data that you have. There is too little data to properly model topics_over_time
or topics_per_class
.
For example, with 20 bins and a topic that only has 18 documents, it is impossible to properly generate topic representations within a single bin as there is a good chance it will contain only 1 document.
Also, I think you can prevent this by removing top_n_topics=6
.
You are right, removing top_n_topics=6.
solves the problem for now. 👍 But what if i have 30 clusters and i want to display only the first 10? Will cluster "0" then be removed again?
That is highly unlikely, more clusters means more outliers which in turn results in a larger -1 class which would then be removed.
Having said that, I will see if I can fix this in the next release although it is likely this assumption (which has hold true thus far for a larger amount of clusters) can be found in multiple places.
Ok, that makes sense. So probably this will only affect smaller datasets, but I hope you also care about them 😃 Thank you so far!
Of course! Do note though that this isn't an issue necessarily with smaller datasets but a small number of topics that are generated. Typically, a low amount of topics indicate that some improvement is needed in the way the BERTopic is applied. For example, by setting a lower min_topic_size
or even by converting your documents into sentences to increase the number of documents that you have available.
I guess I have some experiments ahead of me, thanks for the suggestions!
When checking the Topics over time graph the first cluster is missing. The displayed clusters are
[-1, 1, ..., N ]
. Cluster "0" is missing but it exists when printingtopic_model.get_topic_info()