MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

if calculate_probabilities is set to false, CUDA error occurs #322

Closed juliandehne closed 3 years ago

juliandehne commented 3 years ago

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

I am using sentence-transformers as a backend. Not sure which information you would need to debug this.

MaartenGr commented 3 years ago

Could you create a reproducible example? As in, which code are you using and could you share the versions of packages that you are using? Also, are you working locally or on a server?

juliandehne commented 3 years ago

Thanks for coming back to me so quickly. I am using:

sentence-transformers==2.1.0 bertopic==0.9.1 numpy==1.21.2 Build cuda_11.4.r11.4/compiler.30300941_0 OS: Ubuntu 20 LTS

As to the reproducible example I am not sure. I have a database with a couple of million tweets. The error only occurs when I do batch processing (loading about 60000 tweets or more) like:

bertopic_model = BERTopic(calculate_probabilities=False).load(BERTOPIC_MODEL_LOCATION, embedding_model="sentence-transformers/all-mpnet-base-v2") bertopic_model.transform(tweet_texts)

It does NOT occur if I go:

for text in tweet_texts: bertopic_model.transform(text)

I could give you access to the database, if you have DFN-account.

MaartenGr commented 3 years ago

There might be an issue with how you loaded the topic model. It should be directly from the class without actually instantiating it. So like this:

bertopic_model = BERTopic.load(BERTOPIC_MODEL_LOCATION, embedding_model="sentence-transformers/all-mpnet-base-v2")

This might already alleviate some of the issues that you are having. If that does not work it is likely that you simply run out of GPU memory when trying to transform the text. Especially if you are trying to transform a million tweets, batch processing seems to be the way to go.

juliandehne commented 3 years ago

Thank you for your help so far! The documentation is not very clear on batch processing or there is a bug there. If I load a model, fit_transform more data and then save it, will it update the stored model or will it write a new file with only the data that has been fitted. To rephrase the question: is fit_transform a change of the state of the model (i.e. improving a the model with new data) or will it compute a new model? In that case I could get away with something like this:

for trainings_batch in batch(corpus_for_fitting_sentences, 1000):
        if os.path.isfile(BERTOPIC_MODEL_LOCATION):
            bertopic_model = BERTopic.load(BERTOPIC_MODEL_LOCATION,
                                           embedding_model="sentence-transformers/all-mpnet-base-v2")
        else:
            bertopic_model = BERTopic(embedding_model="sentence-transformers/all-mpnet-base-v2", verbose=True,
                                      calculate_probabilities=False)
        count += 1
        try:
            bertopic_model.fit_transform(trainings_batch)
            bertopic_model.save(BERTOPIC_MODEL_LOCATION)
        except KeyError:
            print("could not transform the batch number {}".format(str(count)))

If I don't store the model in between, there is a key_error on each batch except the first.

I think the issue with the memory, however, is not due to the Bertopic or my usage but to my setup with Cuda. Thus, the orginal issue could be closed.

MaartenGr commented 3 years ago

With respect to your question on fit_transform, this method is used to simply train the model from scratch. So if you run fit_transform once and then run it again with new data, it will only fit on the new data and forget all about the original data. Thus, a new model will be created.

juliandehne commented 3 years ago

Thank you for the swift response! In that case I don't know how I can implement batch processing for a larger dataset. If I do it in one go, I do run out of memory at some point (after 60k batches or so). There seems to be a upper bound for the library. I am still investigating something you wrote about a newer version of hdbscan but I understood it to only improve on the problem, but it does not allow for unlimited size of data input, if I got this right.

juliandehne commented 3 years ago

Closing as this got far away from the original problem in the headline.