Closed yotammarton closed 2 years ago
With respect to UMAP, I believe there is a n_jobs
parameter that is set to -1 as a default, so all cores should be used at least for training. I think this would also affect inference but I am not entirely sure. The same applies to HDBSCAN where we have a n_jobs
parameter set to 4 as a default.
Do you see multiple cores being used during training?
Indeed, I saw these parameters of both UMAP and HDBSCAN.
Unfortunately HDBSCAN caused training problems with n_jobs > 1 (large datasets ±3.5M) so set it to 1, haven't tried to increase it for inference, as it doesn't seem that the source code of approximate_predict
is using multiprocessing
https://github.com/scikit-learn-contrib/hdbscan/blob/4052692af994610adc9f72486a47c905dd527c94/hdbscan/prediction.py#L396
As for UMAP, I use it with default n_jobs
parameter that is set to -1. During training I do see multi core usage, but not during inference.
Reading through some code on both packages it indeed seems that, especially HDBSCAN, does seem to be significantly slower during inference compared to training. It seems that we can either mention this issue in the package's respective issue pages or do some parallelization ourselves until the related packages are updated. I am a big fan of the former rather than the latter as it fixes the issue at the source rather than creating a very temporary fix (which needs to be removed when one of these packages is updated).
Having said that, we can find a middle ground by trying to parallelize BERTopic's inference step outside of the .transform()
step. You could generate the embeddings first and then parallelize .transform()
as it contains roughly three steps:
However, I am not entirely sure this will work without any issues as we will be parallelizing numba.jit
which feels a bit counterintuitive.
With the v0.10 release, we can now use the GPU-accelerated versions of UMAP and HDBSCAN which hopefully should speed things up quite a bit. For now, I'll be closing this issue but if anyone else has ideas for faster inference, please let me know!
Apologies for a possibly dumb question in advance.
How do I enable the gpu acceleration for reducing embeddings and clustering?
With version 0.11.0 my gpu is only used in the first stage (embedding documents)
Thanx in advance
@NateTheGreat001 Good question! The GPU acceleration for reducing embeddings and clustering is not enabled by default as it requires a specific set of dependencies/packages that allow for this acceleration. Namely, you will need to install cuML and use their models in BERTopic as described here in order to use them.
Thank you for your quick reply. Over the last days I gained an awful lot of respect for you and your work.
I spent the entire Monday trying to install cuml together with bertopic on wsl2. Had no success.
I am getting the following error:
Batches: 0%| | 0/81800 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/natethegreat/bertopic/bertopic_model_cuml.py", line 14, in
The entire history of my actions is attached in my_attempt.txt
The summary is as follows:
Install rapidsai with conda create -n rapids -c rapidsai -c nvidia -c conda-forge \ rapids=22.06 python=3.9 cudatoolkit=11.5 That is the only way to install cuml that work for me. Just conda install -c rapidsai cuml gives an error
Install hdbscan with conda install -c conda-forge hdbscan I do this because the command pip install bertopics "failes to build wheels for hdbscan" every time
install bertopic with pip install bertopic. Please note that no matter the os or usage of cuml any attempt to use bertopic at this stage will result in the above mentioned error, likely because only torch and torchvision and no compatible cudatoolkit is included in pip install bertopic. To successfully use bertopic with GPU-accelerated embeddings the next step is to uninstall torch and torchvision included with the original installation of bertopic and a fresh installation of torch with a compatible cuda toolkit.
The problem, as I understand it, is that cuml and torch use diffrent versions of cuda without overlap. So in order to make torch work I install for example cudatoolkit=11.6. This version, as well as any other version used by torch, is incompatible with cuml
I hope you can make some sense out of this mess and hopefully point me in the right direction.
Thank you in advance
Cheers, Nate
In your provided output, PyTorch is throwing an error saying that you’re using an unsupported setup (for which there could be several potential reasons). Assuming you're using a fairly recent GPU with your recent driver, the most likely reason is that PyTorch requires specific versions of CUDA toolkit and you’re selecting one for which it doesn’t provide a compatible binary (11.5), as it isn’t compiled with CUDA Compatibility in at least some versions. cuML is built with CUDA Compatibility, so you should select the minor version of cudatoolkit consistent with PyTorch and let cuML “just work” (as long as it’s the same major version). It sounds like you tried something similar and saw some issues, but it might be worth trying again with a fresh environment similar to the example below (if you were installing/removing things in the same environment it can get messy).
Within this environment, when you pip install bertopic pip needs to bring in all the dependencies, including ones that may need to be compiled (such as hdbscan). As pip is not a wider system package manager (in comparison to Conda), it will throw an error if necessary parts of the dependency chain are missing (such as things you might need to successfully compile a needed package). Using the conda packages often allows you to avoid this by getting the pre-compiled binary. If you can’t install the necessary packages, you can try the conda package for bertopic.
On my Linux machine, the following example works without issue (though I haven't tested it on WSL). Note that I pinned pytorch to 1.11 to explicitly avoid this issue
mamba create -n torchrapids -c rapidsai-nightly -c nvidia -c conda-forge -c pytorch cuml=22.08 python=3.9 cudatoolkit=11.3 pytorch=1.11 torchvision torchaudio bertopic
conda activate torchrapids
(torchrapids) nicholasb@nicholasb-HP-Z8-G4-Workstation:~$ python
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bertopic import BERTopic
>>> from sklearn.datasets import fetch_20newsgroups
>>> from cuml.cluster import HDBSCAN
>>> from cuml.manifold import UMAP
>>> import torch
>>>
>>> print(torch.cuda.is_available())
True
>>> docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
>>>
>>> # Create instances of GPU-accelerated UMAP and HDBSCAN
>>> umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
>>> hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
>>>
>>> # Pass the above models to be used in BERTopic
>>> topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
>>> topics, probs = topic_model.fit_transform(docs)
>>>
Let me know if the above helps provide some direction.
That worked!! Thanks a lot! You guys are the best!
Would probably make sense to include this installation line in the docs
Spoke much too soon.
I am trying to make a model on a relatively large sample (approx 2.5m documents)
Here is the code:
from bertopic import BERTopic from cuml.cluster import HDBSCAN from cuml.manifold import UMAP import pickle
docs = pickle.load(open("docs.pkl", "rb")) # <--- 330 mb. approx. 2.5m samples
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0) hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, verbose=True, calculate_probabilities=False, low_memory=True)
topics = topic_model.fit_transform(docs)
pickle.dump(topic_model, open("bert_model_2.0.pkl", "wb"))
It worked fine without cuml acceleration, but now gives out this error:
2022-07-26 11:34:16,106 - BERTopic - Transformed documents to Embeddings
Traceback (most recent call last):
File "/home/natethegreat/bertopic/bertopic_model_cuml.py", line 14, in
My understanding is that my gpu memory is the problem. I just have 8 gb vs 128 gb of regular ram
Would it solve my problem if I split the large sample into smaller ones and and fit_transform my model on the smaller_samples in a loop?
Adding vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10000) did not help
I wonder if it is possible to run fit_transform on multiple GPU.
@summer5e55 That will depend on the underlying models that you use and whether they support multi-GPU. By default, UMAP and HDBSCAN use no GPU at all so you would have to use cuML instead.
Yes, I have cuML and found on cuML API documentation that there is a multi-node multi GPU implementation of UMAP. from cuml.manifold import UMAP Can I pass this as umap model the example given on the BERTopic website is for single GPU.
@summer5e55 Sure, if it follows the same class structure as mentioned in the documentation then you should be good to go!
In case it's relevant, I'm cross-linking this comment from another issue regarding the potential for multi-GPU UMAP
Hey, I was wondering about shortening the time it takes for UMAP and HDBSCAN to inference on a multi-core machine (with GPU).
Current situation Having a trained (fitted) BERTopic model, Running
BERTopic.transform()
during inference, after the texts been embedded, UMAP and HDBSCAN work on a single CPU core. Digging in UMAP and HDBSCAN repos and issues shows nothing significant I can rely on for multiprocessing.UMAP Diving into
umap.UMAP.transform
source code I have no idea how to tackle that.HDBSCAN https://github.com/MaartenGr/BERTopic/blob/cd98fc8d22ab1eba593c518278ce479d2879c372/bertopic/_bertopic.py#L379
umap_embeddings
is ndarray, can we split it into N chunks, run each chunk on a single core using multiprocessing, and combine back the predictions and probabilities?Will be glad for any help and your experience in speeding up the inference.