MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

RAM load shoots up at the end of .fit_transform() call #129

Closed sokol11 closed 3 years ago

sokol11 commented 3 years ago

Hi. I tried running BERTopic on Google Colab cloud GPUs. The embedding is blazingly fast, compared to what I have been getting on my CPU server.

Unfortunately, at the very end of the .fit_transform() call, the algorithm overflows server RAM (25 GB) and crashes. The crash happens after the HDBSCAN clustering is complete, so I imagine it is during the c-TF-IDF step. Memory load stays around 6-8GB during all prior steps but then shoots up rapidly, after HDBSCAN.

I imagine this is expected behavior for c-TF-IDF, and I just need a high-memory server. Still, I wanted to double-check if it is something that can be rectified with the algorithm settings, or If it could be a bug. Btw, I tried setting calculate_probabilities=False, it did not make a difference.

Any guidance would be hugely appreciated. Thank you!

MaartenGr commented 3 years ago

Glad to hear that extracting the embeddings is much faster now!

Although it could be due to the c-TF-IDF there are a few more things that could cause this issue. First, could you share the code that you used? This helps in identifying the issue by looking at your hyperparameters, order of code, etc. Second, how many documents do you have and how did you pass them to the fit_transform() call? A large number of documents could be the cause but also if they are tokenized or incorrectly passed.

sokol11 commented 3 years ago

Ok. Here's how I instantiate the model:

from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings
from sklearn.feature_extraction.text import CountVectorizer

embeddings = TransformerDocumentEmbeddings('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')
embeddings.tokenizer.model_max_length = 512

vectorizer_model = CountVectorizer(
    max_df = .5,
    ngram_range=(1, 3)
)

topic_model = BERTopic(
    embedding_model=embeddings,
    min_topic_size=10,
    vectorizer_model=vectorizer_model,
    calculate_probabilities=False,
    verbose=True
)

As far as documents, these are scientific publication abstracts. I have them pre-loaded in a df column, so I just convert that column to a list and remove some HTML tags from the strings, like so:

docs = df['abstract'].astype(str).tolist()
docs = [i.replace('<sup>', '') for i in docs]
docs = [i.replace('<i>', '') for i in docs]

Originally, I also had this line there: docs = [i.rstrip('.').replace('.', ' [SEP]') for i in docs]. But then I noticed that the model I use (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) starts considering "sep" as a word token, so it pops up in the topics. So, now I just let the model's pre-trained tokenizer do its thing, which I believe adds [SEP] only at the end of the document, so it considers the whole doc as one long sentence. I imagine this is intended behavior, but not entirely sure.

Finally, I run the model, like so:

topics, probs = topic_model.fit_transform(docs)

Here's the log from the last time I ran it before it crashed:

29439it [13:13, 37.09it/s]
2021-05-28 02:06:00,132 - BERTopic - Transformed documents to Embeddings
2021-05-28 02:06:33,885 - BERTopic - Reduced dimensionality with UMAP
2021-05-28 02:06:36,556 - BERTopic - Clustered UMAP embeddings with HDBSCAN

I tried running it both on my entire data set (~50'000 docs) and a reduced data set (~30'000 docs). It still crashed on the reduced data set.

Any thoughts? Thank you!

MaartenGr commented 3 years ago

Thank you for the extensive response! I am quite sure that the c-TF-IDF has become too large for you to keep in memory. The main cause for this is the n_gram_range of (1, 3). Even with little documents, n-grams of size 3 can create an enormous amount of words since many combinations are possible. To prevent this, simply set min_df in the CountVectorizer to a value larger than 1. Most of the n-grams of size 3 that are created only appears once or twice in all documents and setting min_df to a larger value prevents those from being used in the c-TF-IDF matrix. Try setting it at 5 and see what happens!

sokol11 commented 3 years ago

Thanks, Maarten! This appears to have been the exact issue. After I set min_df=5, RAM load is no longer a problem.

Quick follow-up question: are the topic clusters affected by the n_gram_range or other vectorizer settings? Or does it only affect the topic words/representation?

For example, suppose my vectorizer_model = CountVectorizer(n_gram_range=(1, 3), min_df=5). If I run .fit_transform() with default parameters, followed by .update_topics(vectorizer_model=vectorizer_model), would the final topics be the same than if I run a single call: .fit_transform(vectorizer_model=CountVectorizer(n_gram_range=(1, 3), min_df=5)) without update_topics()?

Thank you, really appreciate your help!

MaartenGr commented 3 years ago

The CountVectorizer should not affect the creation of topic clusters. They are merely meant to create a nicely interpretable topic representation that you can control after the topic model has been created. It might affect the topic reduction, since that is based on similar topics, in part, by their topic representation.

In theory that should be the case. However, since UMAP is stochastic you will notice differences between those outputs. But the process is exactly the same!

sokol11 commented 3 years ago

Understood, thank you!