MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.01k stars 752 forks source link

BERTopic.load() takes forever. #557

Closed Cspellz closed 1 year ago

Cspellz commented 2 years ago

Hi, Thanks in advance for your time and response. :)

I had saved a model which is around 8.5GB, when using the BERTopic.load(), I have been waiting for over 5 hours but it still wasn't getting loaded. It took around 4-5 hours to train the model, but why is load also taking the same or more amount of time?

Also, since there were no verbosity it just looks like it was stuck. Is there some way to start verbosity?

Doing this on my 6 cores and 32GB RAM Laptop.

Is there a way to make it faster or Am I doing something wrong?

Thanks and Regards, Alex

MaartenGr commented 2 years ago

With such a large model, it can indeed take a while longer for the model to load. However, I would not have expected it to take that long. Could you tell me a bit more about the training procedure? How many documents did you train on? How did you initialize BERTopic? And how did you save the model? Also, which version of BERTopic did you use?

Cspellz commented 2 years ago

With such a large model, it can indeed take a while longer for the model to load. However, I would not have expected it to take that long. Could you tell me a bit more about the training procedure? How many documents did you train on? How did you initialize BERTopic? And how did you save the model? Also, which version of BERTopic did you use?

Thank you for your reply @MaartenGr !

Each source document are like 3 or 4 word or phrases like skillsets and job titles, which are around 1.3 million rows.

It was default initialization using sentence_transfomer. The fit_transform() took around 4-5 hours.

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True, nr_topics=100)
 topics, probabilities = topic_model.fit_transform(df)

Thanks Again!! :)

MaartenGr commented 2 years ago

There is not one thing that stands out to me other than the, relatively, large data you are training on which should be fine although some optimization could be done there with respect to the resulting c-TF-IDF matrix:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Train BERTopic with a custom CountVectorizer
vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

Specifying the min_df parameters reduces the vocabulary and reduces the c-TF-IDF size, which in turn reduces the size of the model which could improve the loading time.

Having said that, it is strange to me that it does not load at all. Even a model of several GBs should not take longer than a few minutes at most. Could you try loading it from a duplicate a fresh environment with the same package versions to see if it works there?

Also, how did you save the model and which version of BERTopic did you use?

Cspellz commented 2 years ago

@MaartenGr , I did a fresh environment and used the latest BERTopic , it loaded faster within a minute. Thanks.

Had another question, to apply the topics to my original data set I am using topic_model.find_topics() within an apply-lambda. Using Swifter it takes around 3-3.5 hours. Is there an easier way to do this?

Like passing a Pandas series, returns a Pandas series with topics and/or the probabilities?

Thanks Again !!

MaartenGr commented 2 years ago

Glad to hear that it is working now!

I am not entirely sure if I understand your question but to assign a document to a topic, you can follow this pipeline:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# Train the model
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Assign each document to its corresponding topic
results = pd.DataFrame({"Doc": docs, "Topic": topics})

In the above example, you can see that each document in docs corresponds to a topic in topics that share the same index.

Cspellz commented 2 years ago

Thanks Again! @MaartenGr

Apologies for not making it clear, I got the topics and probs from the fit_transform, but it only returns the first topic(highest probability) and probability.

I require all possible topics like top 5 or top 10 topics similar to the output from .find_topics(), would that be possible via setting of arguments?

It was for getting these lists of topics and probabilities that I was manually running the .find_topics() via a swifter apply-lambda.

Edit: I just checked, the topics returned from fit_transform() and the topics I get from .find_topics() are totally different. I assumed it was the one with the highest probability, but it appears to be mostly random topics(some match) and a lot of them are assigned to -1. If I run the same document manually with .find_topics(), I get a different topic list and it appears to be an accurate match to the topic. In the below image, the "topic" was what I got from fit_transform(), while the "topic_list" was what I got using .find_topics(), as you can see, some match while some do not.

image

MaartenGr commented 2 years ago

I require all possible topics like top 5 or top 10 topics similar to the output from .find_topics(), would that be possible via setting of arguments?

You can get the topic-document probability matrix by setting calculate_probabilities=True when instantiating BERTopic. The resulting probs variable will then contain the probability of each topic belonging to each document. The method for quickly finding topics .find_topics is not suited to calculate this topic-document probability matrix.

Cspellz commented 2 years ago

Thanks Again! @MaartenGr

Apologies for not making it clear, I got the topics and probs from the fit_transform, but it only returns the first topic(highest probability) and probability.

I require all possible topics like top 5 or top 10 topics similar to the output from .find_topics(), would that be possible via setting of arguments?

It was for getting these lists of topics and probabilities that I was manually running the .find_topics() via a swifter apply-lambda.

Edit: I just checked, the topics returned from fit_transform() and the topics I get from .find_topics() are totally different. I assumed it was the one with the highest probability, but it appears to be mostly random topics(some match) and a lot of them are assigned to -1. If I run the same document manually with .find_topics(), I get a different topic list and it appears to be an accurate match to the topic. In the below image, the "topic" was what I got from fit_transform(), while the "topic_list" was what I got using .find_topics(), as you can see, some match while some do not.

image

@MaartenGr , Any updates on this? The topics we get during training mostly get assigned to -1 while if we run it manually using .find_topics(), we get different topics other than -1.

MaartenGr commented 2 years ago

@Cspellz The topics that you get from .fit_transform and .find_topics are indeed different. .fit_transform uses the main underlying algorithm of BERTopic whereas .find_topics uses a keyword-based search by comparing the embedding of the keyword with the topic embedding. .find_topics is not meant to replace .transform and is merely used as a quick way to find topics. Moreover, the .find_topics does not consider -1 to be a topic and by default does not return that topic at all.

MaartenGr commented 2 years ago

Also, a quick note. I do not get a notification whenever someone updates their message which means that I generally do not see any additions and edits to messages you make. If you want to add something I would advise either creating a new comment or simply ping me.

Cspellz commented 2 years ago

@MaartenGr , Thanks for the explanation. I understand what you mean, but I can't use .fit_transform against the whole dataset it's quite huge ( 1.5 mil distinct and over 5mil individual documents).

I tried running it against the full 1.5mil and 5mil data separately and both ended up with the python kernel restarting with OOM errors. So what I did was take a sample about 30-40% of the unique set ( 400-500k) and trained the model, then I apply the model against the 5mil data using the .find_topics. This is still slower, but I was able to do it locally within 2-3 days(sometimes it goes into sleep mode). I will be trying a distributed approach next using databricks/pyspark, so hoping for the best.

It would be great if you have any recommendation when having a lot of documents ( >5mil).

MaartenGr commented 2 years ago

The general idea here is to indeed first use either .fit or .fit_transform. Then, however, it is advised to use .transform instead since .find_topics is something different from how the model was fitted.

It would be great if you have any recommendation when having a lot of documents ( >5mil).

There are quite a number of ways that you can reduce the memory necessary to train your model. You can find an overview of them in the documentation here. Moreover, it might also be helpful to use GPU-accelerated HDBSCAN and UMAP models, as mentioned here. You could also use different sub-models, like PCA, which potentially scale a bit better.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue!