Closed juliandehne closed 3 years ago
Could you create a reproducible example? As in, which code are you using and could you share the versions of packages that you are using? Also, are you working locally or on a server?
Thanks for coming back to me so quickly. I am using:
sentence-transformers==2.1.0 bertopic==0.9.1 numpy==1.21.2 Build cuda_11.4.r11.4/compiler.30300941_0 OS: Ubuntu 20 LTS
As to the reproducible example I am not sure. I have a database with a couple of million tweets. The error only occurs when I do batch processing (loading about 60000 tweets or more) like:
bertopic_model = BERTopic(calculate_probabilities=False).load(BERTOPIC_MODEL_LOCATION, embedding_model="sentence-transformers/all-mpnet-base-v2") bertopic_model.transform(tweet_texts)
It does NOT occur if I go:
for text in tweet_texts: bertopic_model.transform(text)
I could give you access to the database, if you have DFN-account.
There might be an issue with how you loaded the topic model. It should be directly from the class without actually instantiating it. So like this:
bertopic_model = BERTopic.load(BERTOPIC_MODEL_LOCATION, embedding_model="sentence-transformers/all-mpnet-base-v2")
This might already alleviate some of the issues that you are having. If that does not work it is likely that you simply run out of GPU memory when trying to transform the text. Especially if you are trying to transform a million tweets, batch processing seems to be the way to go.
Thank you for your help so far! The documentation is not very clear on batch processing or there is a bug there. If I load a model, fit_transform more data and then save it, will it update the stored model or will it write a new file with only the data that has been fitted. To rephrase the question: is fit_transform a change of the state of the model (i.e. improving a the model with new data) or will it compute a new model? In that case I could get away with something like this:
for trainings_batch in batch(corpus_for_fitting_sentences, 1000):
if os.path.isfile(BERTOPIC_MODEL_LOCATION):
bertopic_model = BERTopic.load(BERTOPIC_MODEL_LOCATION,
embedding_model="sentence-transformers/all-mpnet-base-v2")
else:
bertopic_model = BERTopic(embedding_model="sentence-transformers/all-mpnet-base-v2", verbose=True,
calculate_probabilities=False)
count += 1
try:
bertopic_model.fit_transform(trainings_batch)
bertopic_model.save(BERTOPIC_MODEL_LOCATION)
except KeyError:
print("could not transform the batch number {}".format(str(count)))
If I don't store the model in between, there is a key_error on each batch except the first.
I think the issue with the memory, however, is not due to the Bertopic or my usage but to my setup with Cuda. Thus, the orginal issue could be closed.
With respect to your question on fit_transform
, this method is used to simply train the model from scratch. So if you run fit_transform
once and then run it again with new data, it will only fit on the new data and forget all about the original data. Thus, a new model will be created.
Thank you for the swift response! In that case I don't know how I can implement batch processing for a larger dataset. If I do it in one go, I do run out of memory at some point (after 60k batches or so). There seems to be a upper bound for the library. I am still investigating something you wrote about a newer version of hdbscan but I understood it to only improve on the problem, but it does not allow for unlimited size of data input, if I got this right.
Closing as this got far away from the original problem in the headline.
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling
cublasCreate(handle)
I am using sentence-transformers as a backend. Not sure which information you would need to debug this.