Closed sokol11 closed 3 years ago
Glad to hear that extracting the embeddings is much faster now!
Although it could be due to the c-TF-IDF there are a few more things that could cause this issue.
First, could you share the code that you used? This helps in identifying the issue by looking at your hyperparameters, order of code, etc.
Second, how many documents do you have and how did you pass them to the fit_transform()
call? A large number of documents could be the cause but also if they are tokenized or incorrectly passed.
Ok. Here's how I instantiate the model:
from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings
from sklearn.feature_extraction.text import CountVectorizer
embeddings = TransformerDocumentEmbeddings('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')
embeddings.tokenizer.model_max_length = 512
vectorizer_model = CountVectorizer(
max_df = .5,
ngram_range=(1, 3)
)
topic_model = BERTopic(
embedding_model=embeddings,
min_topic_size=10,
vectorizer_model=vectorizer_model,
calculate_probabilities=False,
verbose=True
)
As far as documents, these are scientific publication abstracts. I have them pre-loaded in a df column, so I just convert that column to a list and remove some HTML tags from the strings, like so:
docs = df['abstract'].astype(str).tolist()
docs = [i.replace('<sup>', '') for i in docs]
docs = [i.replace('<i>', '') for i in docs]
Originally, I also had this line there: docs = [i.rstrip('.').replace('.', ' [SEP]') for i in docs]
. But then I noticed that the model I use (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
) starts considering "sep" as a word token, so it pops up in the topics. So, now I just let the model's pre-trained tokenizer do its thing, which I believe adds [SEP] only at the end of the document, so it considers the whole doc as one long sentence. I imagine this is intended behavior, but not entirely sure.
Finally, I run the model, like so:
topics, probs = topic_model.fit_transform(docs)
Here's the log from the last time I ran it before it crashed:
29439it [13:13, 37.09it/s]
2021-05-28 02:06:00,132 - BERTopic - Transformed documents to Embeddings
2021-05-28 02:06:33,885 - BERTopic - Reduced dimensionality with UMAP
2021-05-28 02:06:36,556 - BERTopic - Clustered UMAP embeddings with HDBSCAN
I tried running it both on my entire data set (~50'000 docs) and a reduced data set (~30'000 docs). It still crashed on the reduced data set.
Any thoughts? Thank you!
Thank you for the extensive response! I am quite sure that the c-TF-IDF has become too large for you to keep in memory. The main cause for this is the n_gram_range of (1, 3). Even with little documents, n-grams of size 3 can create an enormous amount of words since many combinations are possible. To prevent this, simply set min_df
in the CountVectorizer to a value larger than 1. Most of the n-grams of size 3 that are created only appears once or twice in all documents and setting min_df
to a larger value prevents those from being used in the c-TF-IDF matrix. Try setting it at 5 and see what happens!
Thanks, Maarten! This appears to have been the exact issue. After I set min_df=5
, RAM load is no longer a problem.
Quick follow-up question: are the topic clusters affected by the n_gram_range or other vectorizer settings? Or does it only affect the topic words/representation?
For example, suppose my vectorizer_model = CountVectorizer(n_gram_range=(1, 3), min_df=5)
. If I run .fit_transform()
with default parameters, followed by .update_topics(vectorizer_model=vectorizer_model)
, would the final topics be the same than if I run a single call: .fit_transform(vectorizer_model=CountVectorizer(n_gram_range=(1, 3), min_df=5))
without update_topics()
?
Thank you, really appreciate your help!
The CountVectorizer should not affect the creation of topic clusters. They are merely meant to create a nicely interpretable topic representation that you can control after the topic model has been created. It might affect the topic reduction, since that is based on similar topics, in part, by their topic representation.
In theory that should be the case. However, since UMAP is stochastic you will notice differences between those outputs. But the process is exactly the same!
Understood, thank you!
Hi. I tried running BERTopic on Google Colab cloud GPUs. The embedding is blazingly fast, compared to what I have been getting on my CPU server.
Unfortunately, at the very end of the .fit_transform() call, the algorithm overflows server RAM (25 GB) and crashes. The crash happens after the HDBSCAN clustering is complete, so I imagine it is during the c-TF-IDF step. Memory load stays around 6-8GB during all prior steps but then shoots up rapidly, after HDBSCAN.
I imagine this is expected behavior for c-TF-IDF, and I just need a high-memory server. Still, I wanted to double-check if it is something that can be rectified with the algorithm settings, or If it could be a bug. Btw, I tried setting
calculate_probabilities=False
, it did not make a difference.Any guidance would be hugely appreciated. Thank you!