MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

How can i use "precomputed" HDBSCAN in BerTopic? #2209

Open shadabmeymandi opened 2 weeks ago

shadabmeymandi commented 2 weeks ago

BerTopic Version = 0.16.4

umap_model = umap.UMAP(n_neighbors=15, n_components=24, min_dist=0.0, metric='cosine', random_state=100)
embedding_model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
hdbscan_model = HDBSCAN(metric='euclidean',min_cluster_size=10, cluster_selection_method='eom', prediction_data=True)
model = BERTopic( hdbscan_model=hdbscan_model,embedding_model=embedding_model,umap_model=umap_model, language="english", calculate_probabilities=True)
topics, probabilities =model.fit_transform(sentecnes)

as you know when i run the above code there is no problem and every thing is ok , but i must use the hdbscan as below:

hdbscan_model = HDBSCAN(metric='precomputed',min_cluster_size=10, cluster_selection_method='eom', prediction_data=True)
cluster_labels = hdbscan_model.fit_predict(distances)

because i have a pairwaise matrix "distances[ ]" as the distances between my embeddings. with this metric='precomputed' i cant't run my bertopic model and the error is :

ValueError: operands could not be broadcast together with shapes (24,) (24,489)

MaartenGr commented 2 weeks ago

I don't think it's possible to to use "precomputed" at the moment, but I'm not entirely sure. You would have to ignore the UMAP steps with an empty model and perhaps use the distance matrix in place of the embeddings. So perhaps something like this:

from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction

# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model)

# Fit with distance matrix <- Not sure if this works though
topic_model.fit_transform(documents, distance_matrix)

Note: When you create an issue make sure to fill in the bug report as information is still missing, such as the version of BERTopic. Reproduction also shows how you can nicely format your code so that it's easy to read, which is missing here.

shadabmeymandi commented 2 weeks ago

thank you for your answer but it dosen't work,

BerTopic Version = 0.16.4 when I use :

from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction

# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model,hdbscan_model=hdbscan_model,embedding_model=embedding_model,language="english", calculate_probabilities=True)

topic_model.fit_transform(documents, distance_matrix) 

the error is : AttributeError: No prediction data was generated

and when i use : topic_model.fit_transform(documents) instead of topic_model.fit_transform(documents, distance_matrix)

the error is: ValueError: operands could not be broadcast together with shapes (384,) (384,489)

also when I use : topic_model.fit_transform(distance_matrix)

the error is: TypeError: Make sure that the iterable only contains strings.

MaartenGr commented 1 week ago

the error is : AttributeError: No prediction data was generated

Could you share the full error log? Without it, I have no clue which lines of code this refers to.

shadabmeymandi commented 1 week ago

Sure, Here you are: distance_matrix is a (489, 489) matrix, and "sentecnes" is an array of 489 sentences.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-109-c6760120d626>](https://localhost:8080/#) in <cell line: 2>()
      1 # Fit with distance matrix <- Not sure if this works though
----> 2 topic_model.fit_transform(sentecnes,distance_matrix)

4 frames
[/usr/local/lib/python3.10/dist-packages/hdbscan/hdbscan_.py](https://localhost:8080/#) in prediction_data_(self)
   1432     def prediction_data_(self):
   1433         if self._prediction_data is None:
-> 1434             raise AttributeError("No prediction data was generated")
   1435         else:
   1436             return self._prediction_data

AttributeError: No prediction data was generated
MaartenGr commented 1 week ago

Thank you for sharing. That is not the full error log though. Note those 4 frames, you will need to click on that to see the full log.

shadabmeymandi commented 1 week ago

The full error log :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-109-c211ff22e61f>](https://localhost:8080/#) in <cell line: 2>()
      1 # # Fit with distance matrix <- Not sure if this works though
----> 2 topic_model.fit_transform(sentecnes,distance_matrix)

4 frames
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit_transform(self, documents, embeddings, images, y)
    461         if len(documents) > 0:
    462             # Cluster reduced embeddings
--> 463             documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
    464             if self._is_zeroshot() and len(assigned_documents) > 0:
    465                 documents, embeddings = self._combine_zeroshot_topics(

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
   3793 
   3794             if self.calculate_probabilities and is_supported_hdbscan(self.hdbscan_model):
-> 3795                 probabilities = hdbscan_delegator(self.hdbscan_model, "all_points_membership_vectors")
   3796 
   3797         if not partial_fit:

[/usr/local/lib/python3.10/dist-packages/bertopic/cluster/_utils.py](https://localhost:8080/#) in hdbscan_delegator(model, func, embeddings)
     35     if func == "all_points_membership_vectors":
     36         if isinstance(model, hdbscan.HDBSCAN):
---> 37             return hdbscan.all_points_membership_vectors(model)
     38 
     39         str_type_model = str(type(model)).lower()

[/usr/local/lib/python3.10/dist-packages/hdbscan/prediction.py](https://localhost:8080/#) in all_points_membership_vectors(clusterer)
    646     """
    647     clusters = np.array(sorted(list(clusterer.condensed_tree_._select_clusters()))).astype(np.intp)
--> 648     all_points = clusterer.prediction_data_.raw_data
    649 
    650     # When no clusters found, return array of 0's

[/usr/local/lib/python3.10/dist-packages/hdbscan/hdbscan_.py](https://localhost:8080/#) in prediction_data_(self)
   1432     def prediction_data_(self):
   1433         if self._prediction_data is None:
-> 1434             raise AttributeError("No prediction data was generated")
   1435         else:
   1436             return self._prediction_data

AttributeError: No prediction data was generated
MaartenGr commented 1 week ago

Hmmm, not sure why it states that no prediction data was generated. I can assume you still had prediction_data=True, right? You could also try it without calculate_probabilities=True and calculate those later.

shadabmeymandi commented 2 days ago

No diffrence when I had prediction_data=True or not, but without calculate_probabilities=True the problem solved and I built my bertopic model. Unfortunately this is a model without umap steps! Actually I need all steps, so do you have any suggestion how can I calculate topic coherence score under this circumstances:

  1. My model comes from:
    hdbscan_model_1= HDBSCAN(metric='euclidean',min_cluster_size=5, cluster_selection_method='eom', prediction_data=True)
    model = BERTopic( hdbscan_model=hdbscan_model_1,embedding_model=embedding_model,umap_model=umap_model, language="english", calculate_probabilities=True)
    _, _=model.fit_transform(documents)
  2. My topics come from:
    hdbscan_model_2 =  HDBSCAN(metric='precomputed',min_cluster_size=5, cluster_selection_method='eom', prediction_data=True)
    topics= hdbscan_model_2.fit_predict(distance_matrix)

    In other words, how can I compound 1 and 2 for calculating my topic coherence score ?

MaartenGr commented 12 hours ago

In other words, how can I compound 1 and 2 for calculating my topic coherence score ?

You can't. That's not how this works since you would effectively have two different sets of clusters being generated.

Instead, what you might be able to do is to something like this:

model = HDBSCAN(
  metric='precomputed',
  min_cluster_size=5, 
  cluster_selection_method='eom', 
  prediction_data=True
)

class ClusterModel:
    def fit(self, embeddings):
        self.hdbscan_model = model.fit(embeddings)

        # create distance matrix
        distance_matrix = ...

        # Fit model
        self.labels_ = hdbscan_model.fit_predict(distance_matrix )
        return self

    def predict(self, embeddings):
        # create distance matrix
        distance_matrix = ...
        return self.hdbscan_model.predict(distance_matrix)

and then pass that as the cluster model instead:

hdbscan_model = ClusterModel()
topic_model = BERTopic (hdbscan_model=hdbscan_model, umap_model=umap_model)

I haven't checked it myself, but now you have the basic template to work out.