MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.03k stars 757 forks source link

`precomputed` Distance Compatibility for HDBSCAN #1879

Open jjovalle99 opened 6 months ago

jjovalle99 commented 6 months ago

Hi there!

Recently, I've been experimenting with the UMAP + HDBSCAN workflow and noticed an opportunity to enhance its functionality related to distance metrics.

Proposal:

I propose to add compatibility for precomputed distances in HDBSCAN within BERTopic. This would allow users to use custom distance metrics, including the cosine similarity, which is not directly supported as a built-in metric in HDBSCAN.

Why This Matters:

Implementation Insight:

I've already implemented (very quick) this feature locally and found that it integrates well with the existing pipeline. I'm confident that it could be a valuable addition to BERTopic without compromising performance or usability. The following is an non-exhaustive way of implementing this, of course this will need more work to be fully incorporated, but is just a mock of it:

    def __init__(self,
                 language: str = "english",
                 top_n_words: int = 10,
                 n_gram_range: Tuple[int, int] = (1, 1),
                 min_topic_size: int = 10,
                 nr_topics: Union[int, str] = None,
                 low_memory: bool = False,
                 calculate_probabilities: bool = False,
                 seed_topic_list: List[List[str]] = None,
                 zeroshot_topic_list: List[str] = None,
                 zeroshot_min_similarity: float = .7,
                 embedding_model=None,
                 umap_model: UMAP = None,
                 hdbscan_model: hdbscan.HDBSCAN = None,
                 vectorizer_model: CountVectorizer = None,
                 ctfidf_model: TfidfTransformer = None,
                 representation_model: BaseRepresentation = None,
                 verbose: bool = False,
                 distance_matrix: np.ndarray = None, <--------------------
                 ):
        self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,
                                                              metric='euclidean',
                                                              cluster_selection_method='eom',
                                                              prediction_data=True)
        self.distance_matrix = distance_matrix   <--------------------
    def _cluster_embeddings(self,
                            umap_embeddings: np.ndarray,
                            documents: pd.DataFrame,
                            partial_fit: bool = False,
                            y: np.ndarray = None) -> Tuple[pd.DataFrame,
                                                           np.ndarray]:
        ...
        logger.info("Cluster - Start clustering the reduced embeddings")
        if partial_fit:
            self.hdbscan_model = self.hdbscan_model.partial_fit(umap_embeddings)
            labels = self.hdbscan_model.labels_
            documents['Topic'] = labels
            self.topics_ = labels
        elif self.hdbscan_model.get_params()["metric"] == "precomputed":  <--------------------
            logger.info("Cluster - Using a precomputed distance matrix (MUST BE OF THE REDUCED EMBEDDINGS)")
            self.hdbscan_model.fit(self.distance_matrix)
            labels = self.hdbscan_model.labels_
            documents['Topic'] = labels
            self._update_topic_size(documents)

I'd love to hear your thoughts on this proposal. Do you see this as a valuable addition to BERTopic? Would there be any concerns or additional considerations we should account for?

I'm excited about the potential to contribute this feature to the community and look forward to your feedback.

Thank you for considering this enhancement!

MaartenGr commented 6 months ago

Thank you for sharing this extensive description of this use case! I agree that it would be nice to have something like this implemented although I am curious as to how many users would end up using this feature.

Having said that, you can already pass the distance matrix to BERTopic and then simply skip over dimensionality reduction (as you already did before) in order to make this work. It would, however, introduce issues with topic embeddings but I'm actually curious about what would happen.

Lastly, do you think there is a way to implement this without introducing an HDBSCAN-specific parameter to the initialization of BERTopic? The reason why I ask is that my philosophy with BERTopic is to make it as modular as possible, so introducing this parameter might go against that if it is specific to HDBSCAN. Moreover, I want to keep the parameter space as small as possible in the initialization to keep the usage of BERTopic user-friendly. I have already seen some information-overload happening with the current set of parameters.

What do you think?

jjovalle99 commented 6 months ago

Hey @MaartenGr, thank you for answering!

Yes, I think it's possible to implement this. As an initial idea, I think we can just get the metric parameter from HDBSCAN (self.hdbscan_model.get_params()["metric"]) and then define the logic. We can leverage scikit-learn's pairwise metrics to define it without any addition of extra parameters and maintaining modularity.

If I get your approval I can start working on that

MaartenGr commented 6 months ago

Ah right, then we would calculate the distance matrix ourselves based on what has been set within HDBSCAN. I think it's important here that there are additional checks to make sure that a missing "metric" does not run into errors or that it automatically calculates the metric.

Your work on this would be greatly appreciated!