Parallelizing `extract_embeddings()`

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

6.12k stars 763 forks source link

Parallelizing `extract_embeddings()` #26

Closed armandidandeh closed 3 years ago

armandidandeh commented 3 years ago

For larges corpora of documents, extracting BERT embeddings will take a long time.

Parallelizing it would be a sweet feature.

MaartenGr commented 3 years ago

Parallelizing inference is something that is currently being looked at. However, this might bring with it a bunch of issues as parallelization could be done on both CPU and GPU. Fortunately, compared to many other options, sentence-transformers is in itself already quite fast due to an efficient DataLoader.

NicoleNisbett commented 2 years ago

Hi Maarten,

We are having a similar issue when trying to compute the embeddings for a very large dataset. Have there been any updates with the parellizing feature or do you have any other suggestions for computing the embeddings for large data? I am running the data using my university's high performance computing cluster but it has a limit of 48 hours runtime which our model has already exceeded.

sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(alltweets)

I have found a suggestion for batch processing but I do not know if this is applicable to BERTopic.

Many thanks!

MaartenGr commented 2 years ago

@NicoleNisbett Just to be sure, you are using a GPU in that cluster, right?

There are some parameters that you could look at here including a batch_size that might speed up the progress. Also, make sure to set verbose=True in SentenceTransformer to get a feeling of how long the encoding is going to take.

Lastly, if you many millions of datapoints, it might be helpful to simply sample a subset from your data. Typically, you should have more than enough data with 1 million datapoints if you can make sure that the sampling procedure generates a representative subset. Then, you can fit on the subset and apply transform on the entire dataset to get all the topics.

NicoleNisbett commented 2 years ago

Hi Maarten,

Yes I'm using a GPU in that cluster. Thank you for your suggestion for the batch_size parameter, I will try that first and get back to you.

I've also come across this code to encode the sentences in parallel by using multiple GPU nodes which I will also try.

ericchagnon15 commented 1 year ago

Hi, I am undertaking a similar experiment that @NicoleNisbett mentioned in this post earlier and was wondering if the steps here were still considered the best way to go about this. I have a dataset of approximately 1 million articles and am using models from the SentenceTransformer library and also models from the transformers library.

From what I have looked at it seems the optimal way to do this is to train the embedding model via parallelization and then call transform on the docs after. Does this seem like a valid approach or would you guys recommend something different?

MaartenGr commented 1 year ago

@ericchagnon15 With ~1million documents, it might be worthwhile to use GPU-accelerated versions of UMAP and HDBSCAN to speed up both training and inference.

ericchagnon15 commented 1 year ago

The combination of the data and using and the hardware I have could lead to memory issues. I tried using the BERTopic partial_fit so I could train the model in batches but these accelerated methods do not seem to be compatible as they do not have the partial_fit methods themselves. I was wondering if you had experienced this before and had a suggestion on how to proceed.

MaartenGr commented 1 year ago

@ericchagnon15 In that case it might be worthwhile to train on a subset of the data, as much as you can hold in memory, and then predict all other documents.

ericchagnon15 commented 1 year ago

Would you recommend using a non-accelerated clustering algorithm like one of the sklearn ones, and the cuml umap? My dataset is just over 2 million documents and I am planning on training on a subset of 1 million with batching to fix the memory issues. Just unsure which is the best clustering algorithm that supports the online learning that is done with partial_fit. Thanks!

MaartenGr commented 1 year ago

It depends on the size of your batches as UMAP will need a significant number of documents as well. Personally, I would stick with a non-online learning approach if you plan on using cuMLs HDBSCAN, other than that, it would require some experimentation to see what works best for your use case.

beckernick commented 1 year ago

As a note, the memory requirements for cuML's UMAP should be fairly small. But, HDBSCAN can have higher memory requirements when used with BERTOPIC(calculate_probabilities=True) and the default HDBSCAN parameters if your data ends up with many clusters. We have a work in progress PR to reduce the memory requirements, but there are some workarounds listed in this issue for now.