MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6k stars 752 forks source link

v0.9 Topics over time - first cluster is missing #191

Closed zynos closed 3 years ago

zynos commented 3 years ago

When checking the Topics over time graph the first cluster is missing. The displayed clusters are [-1, 1, ..., N ] . Cluster "0" is missing but it exists when printing topic_model.get_topic_info()

MaartenGr commented 3 years ago

Hmmm, could you share the code you were using? Also, how many clusters did you have?

zynos commented 3 years ago

Sure:

topic_model = BERTopic(vectorizer_model=CountVectorizer(stop_words="english"))
topics, probs = topic_model.fit_transform(abstracts)
print(topic_model.get_topic_info())

fig = topic_model.visualize_topics()
fig.show()

topics_over_time = topic_model.topics_over_time(abstracts, topics, timestamps, nr_bins=20)
fig2 = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)

And 7 clusters

Topics
 Topic  Count                                    
0      0   1772        
1      1    424          
2      2    170           
3      3     46                 
4      4     43       
5      5     34  
6     -1     21     
7      6     18        
MaartenGr commented 3 years ago

Does the cluster exist in the topics_over_time variable? It might be that I tried to ignore the most frequent topic in the visualization (which typically should be -1 if enough topics were created) but instead ignored 0.

Seeing the topics though I would highly advise you lower min_topic_size or to increase the amount of data that you have. There is too little data to properly model topics_over_time or topics_per_class.

For example, with 20 bins and a topic that only has 18 documents, it is impossible to properly generate topic representations within a single bin as there is a good chance it will contain only 1 document.

MaartenGr commented 3 years ago

Also, I think you can prevent this by removing top_n_topics=6.

zynos commented 3 years ago

You are right, removing top_n_topics=6. solves the problem for now. 👍 But what if i have 30 clusters and i want to display only the first 10? Will cluster "0" then be removed again?

MaartenGr commented 3 years ago

That is highly unlikely, more clusters means more outliers which in turn results in a larger -1 class which would then be removed.

Having said that, I will see if I can fix this in the next release although it is likely this assumption (which has hold true thus far for a larger amount of clusters) can be found in multiple places.

zynos commented 3 years ago

Ok, that makes sense. So probably this will only affect smaller datasets, but I hope you also care about them 😃 Thank you so far!

MaartenGr commented 3 years ago

Of course! Do note though that this isn't an issue necessarily with smaller datasets but a small number of topics that are generated. Typically, a low amount of topics indicate that some improvement is needed in the way the BERTopic is applied. For example, by setting a lower min_topic_size or even by converting your documents into sentences to increase the number of documents that you have available.

zynos commented 3 years ago

I guess I have some experiments ahead of me, thanks for the suggestions!