Open shadabmeymandi opened 2 weeks ago
I don't think it's possible to to use "precomputed" at the moment, but I'm not entirely sure. You would have to ignore the UMAP steps with an empty model and perhaps use the distance matrix in place of the embeddings. So perhaps something like this:
from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction
# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model)
# Fit with distance matrix <- Not sure if this works though
topic_model.fit_transform(documents, distance_matrix)
Note: When you create an issue make sure to fill in the bug report as information is still missing, such as the version of BERTopic. Reproduction also shows how you can nicely format your code so that it's easy to read, which is missing here.
thank you for your answer but it dosen't work,
BerTopic Version = 0.16.4 when I use :
from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction
# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model,hdbscan_model=hdbscan_model,embedding_model=embedding_model,language="english", calculate_probabilities=True)
topic_model.fit_transform(documents, distance_matrix)
the error is : AttributeError: No prediction data was generated
and when i use :
topic_model.fit_transform(documents)
instead of
topic_model.fit_transform(documents, distance_matrix)
the error is: ValueError: operands could not be broadcast together with shapes (384,) (384,489)
also when I use :
topic_model.fit_transform(distance_matrix)
the error is: TypeError: Make sure that the iterable only contains strings.
the error is : AttributeError: No prediction data was generated
Could you share the full error log? Without it, I have no clue which lines of code this refers to.
Sure, Here you are: distance_matrix is a (489, 489) matrix, and "sentecnes" is an array of 489 sentences.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
[<ipython-input-109-c6760120d626>](https://localhost:8080/#) in <cell line: 2>()
1 # Fit with distance matrix <- Not sure if this works though
----> 2 topic_model.fit_transform(sentecnes,distance_matrix)
4 frames
[/usr/local/lib/python3.10/dist-packages/hdbscan/hdbscan_.py](https://localhost:8080/#) in prediction_data_(self)
1432 def prediction_data_(self):
1433 if self._prediction_data is None:
-> 1434 raise AttributeError("No prediction data was generated")
1435 else:
1436 return self._prediction_data
AttributeError: No prediction data was generated
Thank you for sharing. That is not the full error log though. Note those 4 frames
, you will need to click on that to see the full log.
The full error log :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
[<ipython-input-109-c211ff22e61f>](https://localhost:8080/#) in <cell line: 2>()
1 # # Fit with distance matrix <- Not sure if this works though
----> 2 topic_model.fit_transform(sentecnes,distance_matrix)
4 frames
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit_transform(self, documents, embeddings, images, y)
461 if len(documents) > 0:
462 # Cluster reduced embeddings
--> 463 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
464 if self._is_zeroshot() and len(assigned_documents) > 0:
465 documents, embeddings = self._combine_zeroshot_topics(
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
3793
3794 if self.calculate_probabilities and is_supported_hdbscan(self.hdbscan_model):
-> 3795 probabilities = hdbscan_delegator(self.hdbscan_model, "all_points_membership_vectors")
3796
3797 if not partial_fit:
[/usr/local/lib/python3.10/dist-packages/bertopic/cluster/_utils.py](https://localhost:8080/#) in hdbscan_delegator(model, func, embeddings)
35 if func == "all_points_membership_vectors":
36 if isinstance(model, hdbscan.HDBSCAN):
---> 37 return hdbscan.all_points_membership_vectors(model)
38
39 str_type_model = str(type(model)).lower()
[/usr/local/lib/python3.10/dist-packages/hdbscan/prediction.py](https://localhost:8080/#) in all_points_membership_vectors(clusterer)
646 """
647 clusters = np.array(sorted(list(clusterer.condensed_tree_._select_clusters()))).astype(np.intp)
--> 648 all_points = clusterer.prediction_data_.raw_data
649
650 # When no clusters found, return array of 0's
[/usr/local/lib/python3.10/dist-packages/hdbscan/hdbscan_.py](https://localhost:8080/#) in prediction_data_(self)
1432 def prediction_data_(self):
1433 if self._prediction_data is None:
-> 1434 raise AttributeError("No prediction data was generated")
1435 else:
1436 return self._prediction_data
AttributeError: No prediction data was generated
Hmmm, not sure why it states that no prediction data was generated. I can assume you still had prediction_data=True
, right? You could also try it without calculate_probabilities=True
and calculate those later.
No diffrence when I had prediction_data=True
or not, but without calculate_probabilities=True
the problem solved and I built my bertopic model. Unfortunately this is a model without umap steps!
Actually I need all steps, so do you have any suggestion how can I calculate topic coherence score under this circumstances:
hdbscan_model_1= HDBSCAN(metric='euclidean',min_cluster_size=5, cluster_selection_method='eom', prediction_data=True)
model = BERTopic( hdbscan_model=hdbscan_model_1,embedding_model=embedding_model,umap_model=umap_model, language="english", calculate_probabilities=True)
_, _=model.fit_transform(documents)
hdbscan_model_2 = HDBSCAN(metric='precomputed',min_cluster_size=5, cluster_selection_method='eom', prediction_data=True)
topics= hdbscan_model_2.fit_predict(distance_matrix)
In other words, how can I compound 1 and 2 for calculating my topic coherence score ?
In other words, how can I compound 1 and 2 for calculating my topic coherence score ?
You can't. That's not how this works since you would effectively have two different sets of clusters being generated.
Instead, what you might be able to do is to something like this:
model = HDBSCAN(
metric='precomputed',
min_cluster_size=5,
cluster_selection_method='eom',
prediction_data=True
)
class ClusterModel:
def fit(self, embeddings):
self.hdbscan_model = model.fit(embeddings)
# create distance matrix
distance_matrix = ...
# Fit model
self.labels_ = hdbscan_model.fit_predict(distance_matrix )
return self
def predict(self, embeddings):
# create distance matrix
distance_matrix = ...
return self.hdbscan_model.predict(distance_matrix)
and then pass that as the cluster model instead:
hdbscan_model = ClusterModel()
topic_model = BERTopic (hdbscan_model=hdbscan_model, umap_model=umap_model)
I haven't checked it myself, but now you have the basic template to work out.
BerTopic Version = 0.16.4
as you know when i run the above code there is no problem and every thing is ok , but i must use the hdbscan as below:
because i have a pairwaise matrix "distances[ ]" as the distances between my embeddings. with this metric='precomputed' i cant't run my bertopic model and the error is :
ValueError: operands could not be broadcast together with shapes (24,) (24,489)