Outlier (topic = -1) contains a lot of 'normal records'

caojianneng commented 2 years ago

Hi everyone,

I apply bertopic on Indonesian alias_names. Here, alias_names are very short. In most cases it just contains 1 or 2 words, like "buat jajan" (for snack), and "tabungan" (savings). I have generated 20 topics, which look pretty good. The majority of alias_names in Topic-0 is "buat jajan" (i.e., 65k+ records are buat jajan, with a percentage of above 83%), and I can safely label it accordingly.

However, when I check the outlier topic (i.e., topic = -1), I find there are 37k+ records (i.e., 30%+ records in topic = -1) with "buat jajan", which shall be better to be assigned to topic-0, but NOT.

May I know how this happens? What is the logic behind?

I config language by Indonesian, and use fit_transform when training model. _model = BERTopic(language="Indonesian", nr_topics=20) ... topics, probs = model.fit_transform(namelist)

MaartenGr commented 2 years ago

May I know how this happens? What is the logic behind?

This is related to the underlying clustering algorithm, HDBSCAN, which generates outliers. You can find more about that algorithm here and how it relates to BERTopic here. In short, there are several parameters that control the way HDBSCAN works which may increase or decrease the number of outliers. The thing with outliers is that there is some sort of confidence interval there, a gray area. A data point might be an outlier from one perspective but not from another. The same principle extends to these kinds of algorithms. Having said that, there are quite a number of ways that you can do to reduce the number of outliers. You can find an overview of techniques here for reducing the number of outliers in BERTopic. Especially the first and third tips are important to look at. Just playing around with the HDBSCAN parameters and inspecting them yourselves helps greatly in understanding the algorithm. If you do not want any outliers at all or you are more familiar with an algorithm like k-Means, you can use that instead.

caojianneng commented 2 years ago

Hi @MaartenGr - Thanks a lot for your reply. I will take a study on your links.

caojianneng commented 2 years ago

Hi @MaartenGr - Thanks a lot for your sharing. My experimental experience is consistent with the descriptions in your webpages: 1) HDBSCAN produces clusters topics, which are easier to explain. 2) k-means clustering partitions data without outliers, but the resultant clusters/topics are more difficult to explain.

drob-xx commented 2 years ago

@caojianneng HDBSCAN is a great way to reliably discover topics. However, as you've found, it can produce quite a few uncategorized documents. I've spent a fair amount of time tuning HDBSCAN with UMAP embeddings and packaged up a tool to do this for BERTopic - TopicTuner. I'm anxious to have people try it out - I've tried to make it as accessible and easy to use as possible. I've included a notebook that will quickly step you through its features and am happy to answer any questions you might have. In my experience I've been able to significantly decrease -1 and improve clustering overall using this approach.

caojianneng commented 1 year ago

@drob-xx - Nice to hear of you.

I have read your topic of tuning parameters to improve performance, like "min_cluster_size". Thanks for sharing.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue for now. Feel free to reach out if you want to continue this discussion or re-open the issue!

MaartenGr / BERTopic

Outlier (topic = -1) contains a lot of 'normal records' #753