MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.09k stars 758 forks source link

Topic Visualization and Hierarchy in BERTopic 0.15 with Large Datasets #1319

Closed kyongchyolyang closed 1 year ago

kyongchyolyang commented 1 year ago

Hello!

I am currently in the process of preparing a research paper on Twitter topic modeling using your BERTopic library. For this purpose, I've made modifications to your latest release, the 0.15 version of the BERTopic-Big Data notebook, in order to train it on Twitter data.

However, I've come across some issues depending on the volume of the training data (using the same dataset but changing the volume). The specific issues are as follows:

When using 886,987 twitter text data points:
   - Number of topics generated: 2,194
   - topic_model.visualize_topics(): The method runs without any error, but it doesn't output any result.
   - topic_model.visualize_hierarchy(): This method encounters an error.

When using a 10% sample (88,698 data points):
   - Number of topics generated: 278
   - topic_model.visualize_topics(): This method works fine.
   - topic_model.visualize_hierarchy(): This method again encounters an error.

However, when the dataset is reduced to 1,000 data points, no errors occur at all.

I'm working with Twitter data and need the topic merging functionality to work seamlessly for my analysis. The error message for topic_model.visualize_hierarchy() is as follows:

I'd appreciate any insights or suggestions you might have to help me resolve these issues. Thank you for your time and effort in developing this useful library.

ValueError Traceback (most recent call last) in <cell line: 1>() ----> 1 topic_model.visualize_hierarchy()

6 frames /usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py in visualize_hierarchy(self, orientation, topics, top_n_topics, custom_labels, title, width, height, hierarchical_topics, linkage_function, distance_function, color_threshold) 2751 """ 2752 check_is_fitted(self) -> 2753 return plotting.visualize_hierarchy(self, 2754 orientation=orientation, 2755 topics=topics,

/usr/local/lib/python3.10/dist-packages/bertopic/plotting/_hierarchy.py in visualize_hierarchy(topic_model, orientation, topics, top_n_topics, custom_labels, title, width, height, hierarchical_topics, linkage_function, distance_function, color_threshold) 139 distance_function(x), embeddings.shape[0]) 140 # Create dendogram --> 141 fig = ff.create_dendrogram(embeddings, 142 orientation=orientation, 143 distfun=distance_function_viz,

/usr/local/lib/python3.10/dist-packages/plotly/figure_factory/_dendrogram.py in create_dendrogram(X, orientation, labels, colorscale, distfun, linkagefun, hovertext, color_threshold) 96 distfun = scs.distance.pdist 97 ---> 98 dendrogram = _Dendrogram( 99 X, 100 orientation,

/usr/local/lib/python3.10/dist-packages/plotly/figure_factory/_dendrogram.py in init(self, X, orientation, labels, colorscale, width, height, xaxis, yaxis, distfun, linkagefun, hovertext, color_threshold) 150 distfun = scs.distance.pdist 151 --> 152 (dd_traces, xvals, yvals, ordered_labels, leaves) = self.get_dendrogram_traces( 153 X, colorscale, distfun, linkagefun, hovertext, color_threshold 154 )

/usr/local/lib/python3.10/dist-packages/plotly/figure_factory/_dendrogram.py in get_dendrogram_traces(self, X, colorscale, distfun, linkagefun, hovertext, color_threshold) 338 339 """ --> 340 d = distfun(X) 341 Z = linkagefun(d) 342 P = sch.dendrogram(

/usr/local/lib/python3.10/dist-packages/bertopic/plotting/_hierarchy.py in (x) 136 137 # wrap distance function to validate input and return a condensed distance matrix --> 138 distance_function_viz = lambda x: validate_distance_matrix( 139 distance_function(x), embeddings.shape[0]) 140 # Create dendogram

/usr/local/lib/python3.10/dist-packages/bertopic/_utils.py in validate_distance_matrix(X, n_samples) 139 # Make sure its entries are non-negative 140 if np.any(X < 0): --> 141 raise ValueError("Distance matrix cannot contain negative values.") 142 143 return X

ValueError: Distance matrix cannot contain negative values.

MaartenGr commented 1 year ago

However, when the dataset is reduced to 1,000 data points, no errors occur at all.

This might just be the result of the number of topics that you generated. Thousands of topics might be too many data points for plotly to visualize. I believe the .visualize_topics method does not use WebGL currently to make it easier to visualize relatively large data.

ValueError: Distance matrix cannot contain negative values.

I believe you can find a fix for that issue here. Apparently, the default distance function can return negative values (which might just be a rounding issue) that you can solve with passing a custom distance function.

kyongchyolyang commented 1 year ago

Thank you! It works!!!