MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Model unable to finish if calculate_probabilites=True #1174

Closed econinomista closed 1 year ago

econinomista commented 1 year ago

Dear all, I am facing large issues working with BERT. I have got a dataset of around 1 million tweets. Firstly, I want to train my model with 50 percent of my dataset; then in the second step I want to use transform to make probability predictions for my whole data set. However, the fit part is already not working. I use a virtual machine to run my code with 256 GB RAM, and use Verbose=True, so I can see that the model reaches 100 percent. However, it is not quite clear to me what causes the problem afterwards. The step of transforming docs to embeddings seems to work, however, the model crashes when reducing dimensionality. Crashing means, it takes a very long time but never finishes calculations to go to the next step of clustering. I tried using a cumL Umap, but it is not possible to install cuml since it only runs on Windows 11 which I do not have. Is there any possibility to resolve this? I do need the probabilities to make the predictions for every topic in transform.

Code:

`umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=100)

hdbscan_model = HDBSCAN(min_cluster_size=min_size, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

german_stop_words = stopwords.words('german') vect = CountVectorizer(stop_words = german_stop_words)

topic_model_test = BERTopic(verbose=True,language="multilingual", calculate_probabilities=gen_probs, umap_model=umap_model, vectorizer_model=vect ) topics=topic_model_test.fit_transform(preprocessed_texts)`

MaartenGr commented 1 year ago

Crashing means, it takes a very long time but never finishes calculations to go to the next step of clustering.

Just to be sure, what is the very last thing you see in the logging of BERTopic? Which message of BERTopic do you see? I am asking because it is not entirely clear to me whether the model has finished the dimensionality reduction or not.

Crashing means, it takes a very long time but never finishes calculations to go to the next step of clustering.

I can imagine not wanting to wait a long time before going to the step of clustering. To be precise, how long did you wait? This helps me understand whether it is an error or whether the model simply takes a bit longer than expected.

I do need the probabilities to make the predictions for every topic in transform.

The probabilities are not related to the dimensionality reduction, so there should be no problem there.

If the problem indeed lies within the dimensionality reduction, it might be worthwhile to decrease the number of documents that you train on a bit further. For example, 200_000 is already quite a number of documents to train on. Do note that when you reach the step of clustering, calculating probabilities can take even longer and in that case, it might be worthwhile to use cuML and see if you can get an environment that supports it.

econinomista commented 1 year ago

The last thing I see is the notification „Reducing dimensionality“. We have let this run for 16 hours after the tracking bar of verbose was at 100 percent after about 4 hours. Then „Transforming embeddings“ took 2 hours and „Reducing dimensionality“ appeared. But after 16 hours still nothing had changed. Also in the RAM usage you see that the RAM is not fully used, it looks like processes are starting but crashing again.

Also, I tried using smaller data sets before, however, the results do not look good then. I use twitter data so with smaller data sets, I get too 100 topics from which around 5-10 are always described by stopwords. I am not sure why this is the case because I already use a German Stopword filter in the OnlineVectorizer. Moreover, many topics are created that occur several times (f.I. 4 topics which appear to be migration in top 100 topics). Is there a way to improve that?

Thank you so much for your quick response!! Nikola

Maarten Grootendorst @.***> schrieb am So. 9. Apr. 2023 um 10:41:

Crashing means, it takes a very long time but never finishes calculations to go to the next step of clustering.

Just to be sure, what is the very last thing you see in the logging of BERTopic? Which message of BERTopic do you see? I am asking because it is not entirely clear to me whether the model has finished the dimensionality reduction or not.

Crashing means, it takes a very long time but never finishes calculations to go to the next step of clustering.

I can imagine not wanting to wait a long time before going to the step of clustering. To be precise, how long did you wait? This helps me understand whether it is an error or whether the model simply takes a bit longer than expected.

I do need the probabilities to make the predictions for every topic in transform.

The probabilities are not related to the dimensionality reduction, so there should be no problem there.

If the problem indeed lies within the dimensionality reduction, it might be worthwhile to decrease the number of documents that you train on a bit further. For example, 200_000 is already quite a number of documents to train on. Do note that when you reach the step of clustering, calculating probabilities can take even longer and in that case, it might be worthwhile to use cuML and see if you can get an environment that supports it.

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/1174#issuecomment-1501077099, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6N55VRIZUX4RD4US5LXAJY2VANCNFSM6AAAAAAWX6SS3Y . You are receiving this because you authored the thread.Message ID: @.***>

MaartenGr commented 1 year ago

The last thing I see is the notification „Reducing dimensionality“. We have let this run for 16 hours after the tracking bar of verbose was at 100 percent after about 4 hours. Then „Transforming embeddings“ took 2 hours and „Reducing dimensionality“ appeared. But after 16 hours still nothing had changed. Also in the RAM usage you see that the RAM is not fully used, it looks like processes are starting but crashing again.

Did you see "Reducing dimensionality" or "Reduced dimensionality"? If it is the latter, then that means that dimensionality reduction was actually successful and that the main problem lies with soft-clustering with HDBSCAN. In that case, I believe either not using calculate_probabilities or using cuML's HDBSCAN would be the solution here. If using cuML, then it is important to also use a GPU.

I am not sure why this is the case because I already use a German Stopword filter in the OnlineVectorizer. Moreover, many topics are created that occur several times (f.I. 4 topics which appear to be migration in top 100 topics). Is there a way to improve that?

I might be mistaken but I do not see the OnlineVectorizer in your code. Do you mean the CountVectorizer? Also, can you show some of the stopwords that should have been removed and their exact representations in the vectorizer?

econinomista commented 1 year ago

Unfortunately, I do not have the possibility to use GPU, I also thought about this for the cuml UMAP, but it does not run on my system since I have Windows 2019. But you are right, it says reduced embeddings in the end. Is there any possibility to improve HBDSCAN performance without GPU? As I said, it is necessary to calculate the probabilities in my project, since I want to have the probability distribution for each single tweet.

And yes I am sorry, I meant CountVectorizer. What I do is the following:

vect = CountVectorizer(stop_words = german_stop_words) topic_model_test = BERTopic(verbose=True,language="multilingual", calculate_probabilities=gen_probs, umap_model=umap_model, vectorizer_model=vect ) So I use a prescripted list to filter german stopwords from tweets.

Am So., 9. Apr. 2023 um 11:36 Uhr schrieb Maarten Grootendorst < @.***>:

The last thing I see is the notification „Reducing dimensionality“. We have let this run for 16 hours after the tracking bar of verbose was at 100 percent after about 4 hours. Then „Transforming embeddings“ took 2 hours and „Reducing dimensionality“ appeared. But after 16 hours still nothing had changed. Also in the RAM usage you see that the RAM is not fully used, it looks like processes are starting but crashing again.

Did you see "Reducing dimensionality" or "Reduced dimensionality"? If it is the latter, then that means that dimensionality reduction was actually successful and that the main problem lies with soft-clustering with HDBSCAN. In that case, I believe either not using calculate_probabilities or using cuML's HDBSCAN would be the solution here. If using cuML, then it is important to also use a GPU.

I am not sure why this is the case because I already use a German Stopword filter in the OnlineVectorizer. Moreover, many topics are created that occur several times (f.I. 4 topics which appear to be migration in top 100 topics). Is there a way to improve that?

I might be mistaken but I do not see the OnlineVectorizer in your code. Do you mean the CountVectorizer? Also, can you show some of the stopwords that should have been removed and their exact representations in the vectorizer?

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/1174#issuecomment-1501087152, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6OX575B67DWR2SDRIDXAJ7I3ANCNFSM6AAAAAAWX6SS3Y . You are receiving this because you authored the thread.Message ID: @.***>

MaartenGr commented 1 year ago

But you are right, it says reduced embeddings in the end. Is there any possibility to improve HBDSCAN performance without GPU?

Not really, it is a compute-heavy function in HDBSCAN that I do not believe has any alternatives aside from using cuMLs HDBSCAN.

As I said, it is necessary to calculate the probabilities in my project, since I want to have the probability distribution for each single tweet.

You could still use the .approximate_distributions to create topic distributions for each tweet using a variety of methods.

So I use a prescripted list to filter german stopwords from tweets.

Can you show some of the stopwords that should have been removed and their exact representations in the vectorizer? I am looking for an exact example to see what is happening.

econinomista commented 1 year ago

That is very unfortunate, so in practice, you cannot use BERT for accurate predictions for greater datasets? Moreover, another question occurred. I fitted my model now with 0.3 share of the data and used transform to predict all tweets. Now I want to use hierarchical topics and other visualisations. However it is not working and tells me that there is ValueError("All Arrays must be of the same length"). Is it only possible to use hierarchical topics with fit_transform, so for the training-data and with a training-dataset and then a test dataset?

Am Mo., 10. Apr. 2023 um 08:28 Uhr schrieb Maarten Grootendorst < @.***>:

But you are right, it says reduced embeddings in the end. Is there any possibility to improve HBDSCAN performance without GPU?

Not really, it is a compute-heavy function in HDBSCAN that I do not believe has any alternatives aside from using cuMLs HDBSCAN.

As I said, it is necessary to calculate the probabilities in my project, since I want to have the probability distribution for each single tweet.

You could still use the .approximate_distributions https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html to create topic distributions for each tweet using a variety of methods.

So I use a prescripted list to filter german stopwords from tweets.

Can you show some of the stopwords that should have been removed and their exact representations in the vectorizer? I am looking for an exact example to see what is happening.

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/1174#issuecomment-1501452860, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6KBQDIF6G2TLJVIZS3XAOSA7ANCNFSM6AAAAAAWX6SS3Y . You are receiving this because you authored the thread.Message ID: @.***>

MaartenGr commented 1 year ago

That is very unfortunate, so in practice, you cannot use BERT for accurate predictions for greater datasets?

If you mean with BERT the underlying language models, then yes you can use them for accurate predictions with large datasets. The default language model in BERTopic is sentence-transformers, which can scale quite well with large datasets, especially if you batch the input.

If you mean BERTopic itself, then there are quite a number of ways you can scale to larger datasets in practice. First, you can use GPU acceleration for most components within BERTopic, from the language models and dimensionality reduction to clustering and representation models. More specifically, you can use cuML to speed up UMAP and HDBSCAN quite a bit.

If, however, you do not have access to a GPU, then you need to consider using algorithms that work quite well for larger datasets using a CPU. For instance, you can replace the language model with TF-IDF representations, dimensionality reduction with truncated SVD, and clustering with k-Means. In practice, it is up to you to decide which models work best for your use case. BERTopic is meant to be modular in as many aspects as possible, this, fortunately, means that if you encounter a bottleneck somewhere you can often replace it with something that works better for that use case.

Now I want to use hierarchical topics and other visualisations. However it is not working and tells me that there is ValueError("All Arrays must be of the same length"). Is it only possible to use hierarchical topics with fit_transform, so for the training-data and with a training-dataset and then a test dataset?

It is indeed only possible to perform hierarchical topic modeling on the data it was trained. The reason for this is that the data that is was trained on describe BERTopic's internal topics and as such to perform hierarchical topic modeling we would want to use the data that best describe the original topics.

jburdo1 commented 1 year ago

Now I want to use hierarchical topics and other visualisations. However it is not working and tells me that there is ValueError("All Arrays must be of the same length"). Is it only possible to use hierarchical topics with fit_transform, so for the training-data and with a training-dataset and then a test dataset?

It is indeed only possible to perform hierarchical topic modeling on the data it was trained. The reason for this is that the data that is was trained on describe BERTopic's internal topics and as such to perform hierarchical topic modeling we would want to use the data that best describe the original topics.

Maarten is this also the case for using .get_document_info to display document topic probability? Is there any way to obtain these probabilities using only .transform, or must we fit_transform on any data for which we want to capture document topic probabilities?

MaartenGr commented 1 year ago

@jburdo1 If you had set calculate_probabilities=True and you are using HDBSCAN (not cuML's although that is added in the main branch), then you get the probabilities using .transform.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!