drob-xx / TopicTuner

HDBSCAN Tuning for BERTopic Models
GNU General Public License v3.0
42 stars 1 forks source link
bertopic clustering hdbscan nlp topic-modeling tuning

TopicTuner — Tune BERTopic HDBSCAN Models

To install from PyPi :

pip install topicmodeltuner

The Problem

Out of the box, BERTopic relies upon HDBSCAN to cluster topics. Two of the most important HDBSCAN parameters, min_cluster_size and sample_size will almost always have a dramatic effect on cluster formation. They dictate the number of clusters created including the -1 or uncategorized cluster. While with some datasets a large number of uncategorized documents may be the right clustering, in practice BERTopic will essentially discard a large percentage of "good" documents and not use them for cluster formation and topic formation.

HDBSCAN is quite sensitive to the values of these two parameters relative to the text being clustered. This means that when using the BERTopic default value of min_topic_size=10 (which is assigned to HDBSCAN's min_cluster_size) the default parameters will more often than not result in an unmanageable number of topics; as well as a sub-optimal number of uncategorized documents. Additionally, documents assigned to the -1 category will not be used to determine topic vocabularly results.

The Solution

TopicTuner provides a TopicModelTuner class — a convenience wrapper for BERTopic Models that efficiently manages the process of discovering optimized min_cluster_size and sample_size parameters, providing:

To get you started this release includes both a demo notebook and API documentation