Closed armandidandeh closed 3 years ago
Parallelizing inference is something that is currently being looked at. However, this might bring with it a bunch of issues as parallelization could be done on both CPU and GPU. Fortunately, compared to many other options, sentence-transformers
is in itself already quite fast due to an efficient DataLoader
.
Hi Maarten,
We are having a similar issue when trying to compute the embeddings for a very large dataset. Have there been any updates with the parellizing feature or do you have any other suggestions for computing the embeddings for large data? I am running the data using my university's high performance computing cluster but it has a limit of 48 hours runtime which our model has already exceeded.
sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(alltweets)
I have found a suggestion for batch processing but I do not know if this is applicable to BERTopic.
Many thanks!
@NicoleNisbett Just to be sure, you are using a GPU in that cluster, right?
There are some parameters that you could look at here including a batch_size
that might speed up the progress. Also, make sure to set verbose=True
in SentenceTransformer to get a feeling of how long the encoding is going to take.
Lastly, if you many millions of datapoints, it might be helpful to simply sample a subset from your data. Typically, you should have more than enough data with 1 million datapoints if you can make sure that the sampling procedure generates a representative subset. Then, you can fit
on the subset and apply transform
on the entire dataset to get all the topics.
Hi Maarten,
Yes I'm using a GPU in that cluster. Thank you for your suggestion for the batch_size
parameter, I will try that first and get back to you.
I've also come across this code to encode the sentences in parallel by using multiple GPU nodes which I will also try.
Hi, I am undertaking a similar experiment that @NicoleNisbett mentioned in this post earlier and was wondering if the steps here were still considered the best way to go about this. I have a dataset of approximately 1 million articles and am using models from the SentenceTransformer
library and also models from the transformers
library.
From what I have looked at it seems the optimal way to do this is to train the embedding model via parallelization and then call transform on the docs after. Does this seem like a valid approach or would you guys recommend something different?
@ericchagnon15 With ~1million documents, it might be worthwhile to use GPU-accelerated versions of UMAP and HDBSCAN to speed up both training and inference.
The combination of the data and using and the hardware I have could lead to memory issues. I tried using the BERTopic partial_fit
so I could train the model in batches but these accelerated methods do not seem to be compatible as they do not have the partial_fit
methods themselves. I was wondering if you had experienced this before and had a suggestion on how to proceed.
@ericchagnon15 In that case it might be worthwhile to train on a subset of the data, as much as you can hold in memory, and then predict all other documents.
Would you recommend using a non-accelerated clustering algorithm like one of the sklearn ones, and the cuml
umap? My dataset is just over 2 million documents and I am planning on training on a subset of 1 million with batching to fix the memory issues. Just unsure which is the best clustering algorithm that supports the online learning that is done with partial_fit
. Thanks!
It depends on the size of your batches as UMAP will need a significant number of documents as well. Personally, I would stick with a non-online learning approach if you plan on using cuMLs HDBSCAN, other than that, it would require some experimentation to see what works best for your use case.
As a note, the memory requirements for cuML's UMAP should be fairly small. But, HDBSCAN can have higher memory requirements when used with BERTOPIC(calculate_probabilities=True)
and the default HDBSCAN parameters if your data ends up with many clusters. We have a work in progress PR to reduce the memory requirements, but there are some workarounds listed in this issue for now.
For larges corpora of documents, extracting BERT embeddings will take a long time.
Parallelizing it would be a sweet feature.