Closed sophvaladou closed 1 year ago
To start off, could you share your entire code for training and instantiating the model? That way, it becomes a bit easier to see what exactly is happening here.
However, when I ran the .fit_transform on these documents after setting the min_cluster_size of the HDBSCAN algorithm to 500, my runtime crashed in Google Colab. I have no clue what happened because after the runtime crashes I cannot see what happened. It crashes instantly so I cannot see what happens to the resources.
If you set verbose=True
then it should give you some logging that will help us understand where it might be crashing. Could you check after which step it crashes?
The issue is that most of the topics have under 500 documents. Out of 4553, 4124 have less than 500 docs, so only 429 topics are "useful". This is a good thing because I don't want 4553 topics (that's too much), but I don't know what to do next. Is it a good idea to use .reduce_topics and set it to around 429, or will this not solve the problem? I could also only use the 429 "big" topics and discard the rest but then I lose quite some information I think.
The thing is, with 3 million tweets, it is quite expected to have several thousands of topics. If you want to reduce that number of topics, you will, almost by definition, get less fine-grained topics as unrelated documents get merged.
There are a number of things you can do. First, you can stick with the 429 "big" topics, train a supervised BERTopic model on those topics and predict for all other tweets. Second, increasing the min_topic_size
is definitely a good option assuming we can find out why it is crashing for you. Increasing that value should indeed lead to larger topics. Third, you can try to use a different clustering algorithm, like k-Means to make sure that everything is in a topic and that relatively large topic get created. Lastly, you can try to reduce the number of topics but going from ~5000 topics to ~500 topics might be a step too big.
One last question: out of the 3014471 tweets, 1677106 were assigned topic -1. I know HDBSCAN takes noise into account, but this looks like a lot of noise. Tweets are short so maybe the problem lies there, but is there a way to reduce the noise without having no noise at all? I know I could use other algorithms like k-means, but I think the incorporation of noise is quite useful in my analysis, so I was wondering whether there was a "middle way".
You could use the .reduce_outliers
method in BERTopic which allows for dynamically updating the outlier topics depending on a threshold of your choosing.
This is the code for the model. I reduced the embeddings separately to save them so I could use them with different HDBSCAN algorithms later on.
df = pd.read_csv(data_dir, usecols = ["text_unclean", "full_text_lower_lemma", "created_at"])
text_unclean = df['text_unclean'].astype(str)
reducer_5d_15n = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
embeddings_ST_unclean_UMAP_5d_15n = reducer_5d_15n.fit_transform(ST_unclean)
empty_dimensionality_model = BaseDimensionalityReduction()
hdbscan_model = HDBSCAN(min_cluster_size=500, metric='euclidean', cluster_selection_method='eom', prediction_data=False)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model_ST_unclean_default = BERTopic(umap_model=empty_dimensionality_model, hdbscan_model=hdbscan_model, ctfidf_model=ctfidf_model, top_n_words = 15, verbose=True)
topic_model_ST_unclean_default_topics, topic_model_ST_unclean_default_probs = topic_model_ST_unclean_default.fit_transform(documents = text_unclean, embeddings = embeddings_ST_unclean_UMAP_5d_15n)
Since I use reduced embeddings, it crashes at the clustering step (verbose confirms that "the embeddings are reduced").
I tried reducing the topics to 500, but my runtime crashed here because of memory issues. I have 51GB RAM in Colab which apparently was not enough for .reduce_topics
.
Thank you for the other options! I will look into those as well.
Are you using cuML's HDBSCAN? If not, then that might be a good solution to the problem you are facing. It scales much better than the original implementation if you have a GPU.
NOTE: I changed your message a bit such that it is easier to read the code.
Yes I used the cuML HDBSCAN. It works really well for a smaller min_topic_size: it only takes around 20 minutes to cluster and does not explode the system RAM or GPU RAM immediately, so I don't know why it will not work with a larger min_topic_size.
In that case, it might be worthwhile to open up an issue at their repo as I am not entirely sure what might be causing it in their solution. They have much more expertise when it comes to GPU-accelerated HDBSCAN. Also, as a side note, I believe the v23.04 version of cuML should also improve upon a number of things, so it might be worthwhile to try that.
In general, increasing min_cluster_size
will increase the required memory to fit the HDBSCAN model (on CPU or GPU). At large values of min_cluster_size
, cuML's HDBSCAN currently requires more memory than the CPU version (see https://github.com/rapidsai/cuml/issues/5357 for an example) and is likely causing your out-of-memory issue when using 500 as the parameter value. We'll look into this behavior.
For now, in addition to Maarten's suggestions above, if you don't need calculate_probabilities=True
, you could use cuML's UMAP (the CPU version will be very slow with millions of records) but use CPU HDBSCAN (see this blog for more information) as the HDBSCAN fit is not usually the bottleneck without calculate_probabilities=True
.
Alternatively, you could rent a larger GPU in the cloud (if you're using the free tier GPU on Colab). The A100 GPU available on cloud service providers has 80 GB of memory, which may be sufficient depending on your min_cluster_size
value and the dataset.
Closing this due to inactivity. Let me know if I need to re-open the issue!
Hello,
I am working with a very large corpus of around 3M documents. Thus, I wanted to increase the min_cluster_size in HDBSCAN to 500 to decrease the number of topics. Moreover, small topics with only a few documents have no value in my research (I am looking for trending Twitter topics), they only matter if there are about 500 documents related to it.
However, when I ran the .fit_transform on these documents after setting the min_cluster_size of the HDBSCAN algorithm to 500, my runtime crashed in Google Colab. I have no clue what happened because after the runtime crashes I cannot see what happened. It crashes instantly so I cannot see what happens to the resources.
I know that it has to be because the resources, either the GPU RAM or the System RAM, are fully used. However, I do not have enough knowledge to know why that happens when increasing min_cluster_size.
Next, I tried decreasing it from 500 to 100 and got the same error. When setting it to 50, it did work. As parameters for the algorithms, I used all the defaults for UMAP and HDBSCAN (except the min_cluster_size).
The issue is that most of the topics have under 500 documents. Out of 4553, 4124 have less than 500 docs, so only 429 topics are "useful". This is a good thing because I don't want 4553 topics (that's too much), but I don't know what to do next. Is it a good idea to use .reduce_topics and set it to around 429, or will this not solve the problem? I could also only use the 429 "big" topics and discard the rest but then I lose quite some information I think.
One last question: out of the 3014471 tweets, 1677106 were assigned topic -1. I know HDBSCAN takes noise into account, but this looks like a lot of noise. Tweets are short so maybe the problem lies there, but is there a way to reduce the noise without having no noise at all? I know I could use other algorithms like k-means, but I think the incorporation of noise is quite useful in my analysis, so I was wondering whether there was a "middle way".
Thank you in advance!!