Thoughts on multiprocessing UMAP,HDBSCAN for faster inference

yotammarton commented 2 years ago

Hey, I was wondering about shortening the time it takes for UMAP and HDBSCAN to inference on a multi-core machine (with GPU).

Current situation Having a trained (fitted) BERTopic model, Running BERTopic.transform() during inference, after the texts been embedded, UMAP and HDBSCAN work on a single CPU core. Digging in UMAP and HDBSCAN repos and issues shows nothing significant I can rely on for multiprocessing.

UMAP Diving into umap.UMAP.transform source code I have no idea how to tackle that.

HDBSCAN https://github.com/MaartenGr/BERTopic/blob/cd98fc8d22ab1eba593c518278ce479d2879c372/bertopic/_bertopic.py#L379 umap_embeddings is ndarray, can we split it into N chunks, run each chunk on a single core using multiprocessing, and combine back the predictions and probabilities?

Will be glad for any help and your experience in speeding up the inference.

MaartenGr commented 2 years ago

With respect to UMAP, I believe there is a n_jobs parameter that is set to -1 as a default, so all cores should be used at least for training. I think this would also affect inference but I am not entirely sure. The same applies to HDBSCAN where we have a n_jobs parameter set to 4 as a default.

Do you see multiple cores being used during training?

yotammarton commented 2 years ago

Indeed, I saw these parameters of both UMAP and HDBSCAN. Unfortunately HDBSCAN caused training problems with n_jobs > 1 (large datasets ±3.5M) so set it to 1, haven't tried to increase it for inference, as it doesn't seem that the source code of approximate_predict is using multiprocessing https://github.com/scikit-learn-contrib/hdbscan/blob/4052692af994610adc9f72486a47c905dd527c94/hdbscan/prediction.py#L396

As for UMAP, I use it with default n_jobs parameter that is set to -1. During training I do see multi core usage, but not during inference.

MaartenGr commented 2 years ago

Reading through some code on both packages it indeed seems that, especially HDBSCAN, does seem to be significantly slower during inference compared to training. It seems that we can either mention this issue in the package's respective issue pages or do some parallelization ourselves until the related packages are updated. I am a big fan of the former rather than the latter as it fixes the issue at the source rather than creating a very temporary fix (which needs to be removed when one of these packages is updated).

Having said that, we can find a middle ground by trying to parallelize BERTopic's inference step outside of the .transform() step. You could generate the embeddings first and then parallelize .transform() as it contains roughly three steps:

Embedding documents (Language Model)
Reducing embeddings (UMAP)
Cluster reduced embeddings (HDBSCAN)

However, I am not entirely sure this will work without any issues as we will be parallelizing numba.jit which feels a bit counterintuitive.

MaartenGr commented 2 years ago

With the v0.10 release, we can now use the GPU-accelerated versions of UMAP and HDBSCAN which hopefully should speed things up quite a bit. For now, I'll be closing this issue but if anyone else has ideas for faster inference, please let me know!

ghost commented 2 years ago

Apologies for a possibly dumb question in advance.

How do I enable the gpu acceleration for reducing embeddings and clustering?

With version 0.11.0 my gpu is only used in the first stage (embedding documents)

Thanx in advance

MaartenGr commented 2 years ago

@NateTheGreat001 Good question! The GPU acceleration for reducing embeddings and clustering is not enabled by default as it requires a specific set of dependencies/packages that allow for this acceleration. Namely, you will need to install cuML and use their models in BERTopic as described here in order to use them.

ghost commented 2 years ago

Thank you for your quick reply. Over the last days I gained an awful lot of respect for you and your work.

I spent the entire Monday trying to install cuml together with bertopic on wsl2. Had no success.

I am getting the following error:

Batches: 0%| | 0/81800 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/natethegreat/bertopic/bertopic_model_cuml.py", line 14, in topics = topic_model.fit_transform(docs) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 301, in fit_transform embeddings = self._extract_embeddings(documents.Document, File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 2035, in _extract_embeddings embeddings = self.embedding_model.embed_documents(documents, verbose) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/bertopic/backend/_base.py", line 69, in embed_documents return self.embed(document, verbose) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/bertopic/backend/_sentencetransformers.py", line 63, in embed embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 165, in encode out_features = self.forward(features) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 66, in forward output_states = self.auto_model(trans_features, return_dict=False) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 991, in forward extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape) File "/home/natethegreat/miniconda3/envs/rapids/lib/python3.9/site-packages/transformers/modeling_utils.py", line 839, in get_extended_attention_mask extended_attention_mask = extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

my_attempt.txt

The entire history of my actions is attached in my_attempt.txt

The summary is as follows:

Install rapidsai with conda create -n rapids -c rapidsai -c nvidia -c conda-forge \ rapids=22.06 python=3.9 cudatoolkit=11.5 That is the only way to install cuml that work for me. Just conda install -c rapidsai cuml gives an error
Install hdbscan with conda install -c conda-forge hdbscan I do this because the command pip install bertopics "failes to build wheels for hdbscan" every time
install bertopic with pip install bertopic. Please note that no matter the os or usage of cuml any attempt to use bertopic at this stage will result in the above mentioned error, likely because only torch and torchvision and no compatible cudatoolkit is included in pip install bertopic. To successfully use bertopic with GPU-accelerated embeddings the next step is to uninstall torch and torchvision included with the original installation of bertopic and a fresh installation of torch with a compatible cuda toolkit.
The problem, as I understand it, is that cuml and torch use diffrent versions of cuda without overlap. So in order to make torch work I install for example cudatoolkit=11.6. This version, as well as any other version used by torch, is incompatible with cuml

I hope you can make some sense out of this mess and hopefully point me in the right direction.

Thank you in advance

Cheers, Nate

beckernick commented 2 years ago

In your provided output, PyTorch is throwing an error saying that you’re using an unsupported setup (for which there could be several potential reasons). Assuming you're using a fairly recent GPU with your recent driver, the most likely reason is that PyTorch requires specific versions of CUDA toolkit and you’re selecting one for which it doesn’t provide a compatible binary (11.5), as it isn’t compiled with CUDA Compatibility in at least some versions. cuML is built with CUDA Compatibility, so you should select the minor version of cudatoolkit consistent with PyTorch and let cuML “just work” (as long as it’s the same major version). It sounds like you tried something similar and saw some issues, but it might be worth trying again with a fresh environment similar to the example below (if you were installing/removing things in the same environment it can get messy).

Within this environment, when you pip install bertopic pip needs to bring in all the dependencies, including ones that may need to be compiled (such as hdbscan). As pip is not a wider system package manager (in comparison to Conda), it will throw an error if necessary parts of the dependency chain are missing (such as things you might need to successfully compile a needed package). Using the conda packages often allows you to avoid this by getting the pre-compiled binary. If you can’t install the necessary packages, you can try the conda package for bertopic.

On my Linux machine, the following example works without issue (though I haven't tested it on WSL). Note that I pinned pytorch to 1.11 to explicitly avoid this issue

mamba create -n torchrapids -c rapidsai-nightly -c nvidia -c conda-forge -c pytorch cuml=22.08 python=3.9 cudatoolkit=11.3 pytorch=1.11 torchvision torchaudio bertopic

conda activate torchrapids

(torchrapids) nicholasb@nicholasb-HP-Z8-G4-Workstation:~$ python
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bertopic import BERTopic
>>> from sklearn.datasets import fetch_20newsgroups
>>> from cuml.cluster import HDBSCAN
>>> from cuml.manifold import UMAP
>>> import torch
>>> 
>>> print(torch.cuda.is_available())
True
>>> docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
>>> 
>>> # Create instances of GPU-accelerated UMAP and HDBSCAN
>>> umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
>>> hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
>>> 
>>> # Pass the above models to be used in BERTopic
>>> topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
>>> topics, probs = topic_model.fit_transform(docs)
>>>

Let me know if the above helps provide some direction.

ghost commented 2 years ago

That worked!! Thanks a lot! You guys are the best!

Would probably make sense to include this installation line in the docs

ghost commented 2 years ago

Spoke much too soon.

I am trying to make a model on a relatively large sample (approx 2.5m documents)

Here is the code:

from bertopic import BERTopic from cuml.cluster import HDBSCAN from cuml.manifold import UMAP import pickle

docs = pickle.load(open("docs.pkl", "rb")) # <--- 330 mb. approx. 2.5m samples

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0) hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, verbose=True, calculate_probabilities=False, low_memory=True)

topics = topic_model.fit_transform(docs)

pickle.dump(topic_model, open("bert_model_2.0.pkl", "wb"))

It worked fine without cuml acceleration, but now gives out this error:

2022-07-26 11:34:16,106 - BERTopic - Transformed documents to Embeddings Traceback (most recent call last): File "/home/natethegreat/bertopic/bertopic_model_cuml.py", line 14, in topics = topic_model.fit_transform(docs) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 313, in fit_transform umap_embeddings = self._reduce_dimensionality(embeddings, y) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 2070, in _reduce_dimensionality umap_embeddings = self.umap_model.transform(embeddings) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get ret_val = func(*args, kwargs) File "cuml/manifold/umap.pyx", line 674, in cuml.manifold.umap.UMAP.transform File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/contextlib.py", line 79, in inner return func(*args, *kwds) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 360, in inner return func(args, kwargs) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/cuml/common/input_utils.py", line 367, in input_to_cuml_array X = cp.array(X) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/cupy/_creation/from_data.py", line 46, in array return _core.array(obj, dtype, copy, order, subok, ndmin) File "cupy/_core/core.pyx", line 2266, in cupy._core.core.array File "cupy/_core/core.pyx", line 2290, in cupy._core.core.array File "cupy/_core/core.pyx", line 2422, in cupy._core.core._array_default File "cupy/_core/core.pyx", line 171, in cupy._core.core.ndarray.init File "cupy/cuda/memory.pyx", line 698, in cupy.cuda.memory.alloc File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/rmm/rmm.py", line 232, in rmm_cupy_allocator buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream) File "device_buffer.pyx", line 88, in rmm._lib.device_buffer.DeviceBuffer.cinit MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/natethegreat/miniconda3/envs/torchrapids/include/rmm/mr/device/cuda_memory_resource.hpp Segmentation fault (torchrapids) natethegreat@MAYBACH:~/bertopic$

My understanding is that my gpu memory is the problem. I just have 8 gb vs 128 gb of regular ram

Would it solve my problem if I split the large sample into smaller ones and and fit_transform my model on the smaller_samples in a loop?

Adding vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10000) did not help

summer5e55 commented 11 months ago

I wonder if it is possible to run fit_transform on multiple GPU.

MaartenGr commented 11 months ago

@summer5e55 That will depend on the underlying models that you use and whether they support multi-GPU. By default, UMAP and HDBSCAN use no GPU at all so you would have to use cuML instead.

summer5e55 commented 11 months ago

Yes, I have cuML and found on cuML API documentation that there is a multi-node multi GPU implementation of UMAP. from cuml.manifold import UMAP Can I pass this as umap model the example given on the BERTopic website is for single GPU.

MaartenGr commented 11 months ago

@summer5e55 Sure, if it follows the same class structure as mentioned in the documentation then you should be good to go!

beckernick commented 11 months ago

In case it's relevant, I'm cross-linking this comment from another issue regarding the potential for multi-GPU UMAP

MaartenGr / BERTopic

Thoughts on multiprocessing UMAP,HDBSCAN for faster inference #381