Closed p-dre closed 1 year ago
There currently is a GPU-accelerated implementation by rapidsai that you can find here that you can try out. I have yet to try it out but from what I have heard there is quite a big speed-up!
cc @vibhujawa
@p-dre A few days ago, I released BERTopic v0.10.0 which allows you to use different models for HDBSCAN and UMAP. This also allows you to use the GPU-accelerated version of HDBSCAN and UMAP developed by cuML. After installing cuML, you can run it with BERTopic as follows:
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)
It should speed up BERTopic quite a bit! Also, since you now can replace HDBSCAN and UMAP, you could also replace them with algorithms, like PCA and kMeans, which might be a bit faster. It could hurt the quality of the resulting topics though, so some experimentation might be necessary.
@MaartenGr , Thanks a lot. Its great to learn that now it is possible to use different models for HDBSCAN
and UMAP.
From a benchmark perspective on a workflow we saw following speedups on a end to end BERTopic Workflow. (Checkout the full blog here)
UMAP: 2718 s
(CPU) to 98 s
(GPU)
HDBSCAN: 382.00
(CPU) to 92 s
(GPU).
@MaartenGr amazing, Thank you very much!!!
@MaartenGr As cuml.cluster.HDBSCAN
is not an instance of hdbscan.HDBSCAN
, the isinstance checks in lines 388, 1431 and 1548 return False, resulting in the probabilities (hdbscan_model.probabilities_
) being ignored, although the cuml implementation does provide them.
I'm also wondering whether the hdbscan.HDBSCAN
could be initialized with the result from cuml.cluster.HDBSCAN
, so that the HDBSCAN.membership_vector
method could be used, when BERTopic is called with calculate_probabilities=True
?
@kuchenrolle After using the cuml.cluster.HDBSCAN
model, you can access the probabilities with topic_model.hdbscan_model.probabilities_
. I am not entirely sure though whether we can use the membership_vector
in cuml through the original method.
As a note, membership_vector and all_points_membership_vectors are on our radar for cuML's HDBSCAN.
Perhaps this might be an opportunity to define something like is_hdbscan_like
in the spirit of scikit-learn's is_classifier
and is_regressor
? We use this pattern in Dask quite a bit for duck-typing based checks to support multiple backends via dispatching. (Perhaps explicit dispatching might be of interest here, too).
For example:
def is_dataframe_like(df) -> bool:
"""Looks like a Pandas DataFrame"""
if (df.__class__.__module__, df.__class__.__name__) == (
"pandas.core.frame",
"DataFrame",
):
# fast exec for most likely input
return True
typ = df.__class__
return (
all(hasattr(typ, name) for name in ("groupby", "head", "merge", "mean"))
and all(hasattr(df, name) for name in ("dtypes", "columns"))
and not any(hasattr(typ, name) for name in ("name", "dtype"))
)
The AutoML library TPOT did something similar when they added support for cuML and defined _is_selector
and _is_transformer
. They used this pattern again when they later added _is_resampler
to include support for the scikit-learn-contrib project imbalanced-learn.
def _is_selector(estimator):
selector_attributes = [
"get_support",
"transform",
"inverse_transform",
"fit_transform",
]
return all(hasattr(estimator, attr) for attr in selector_attributes)
I'd be happy to participate in a discussion on this topic if there is interest.
I would also be very interested in the "all_points_membership_vectors" functionality via cuML HDBSCAN. In some use cases this offers a good way to reduce the -1 clusters considerably without significant quality loss. However, with the use of the hdbscan.HDBSCAN implementation and large datasets (several millions of records) it suffers greatly in terms of efficiency.
@beckernick Interesting! Haven't seen such a pattern before but it definitely seems like it would fit nicely with the use cases described here.
Assuming the goal is to have a 1:1 mapping of functionality between the original HDBSCAN and cuML HDBSCAN, a few functions are missing like .membership_vector
and I believe .approximate_predict
that are necessary to reach the same functionality. Would it make sense to first wait until those are developed before creating a is_hdbscan_like
function?
We're a big fan of these duck typing based utilities. I think whether it makes sense to wait depends on the nature of the integration you'd be interested in supporting. We do plan to expand our HDBSCAN support.
At the moment (if folks didn't want to wait), I suspect we could resolve the "missing probabilities" issue noted above with some duck typing or light special casing around here (and the equivalent in the transform
codepath):
Having thought a bit more about the duck typing approach, because functions like all_points_membership_vectors
, approximate_predict
, and membership_vector
are in the top-level module namespace, it's more challenging to rely on pure duck typing alone instead of including some kind of explicit dispatch/delegation process. Protocol-based dispatch mechanisms are elegant (such as NEP-18 and NEP-35 in NumPy), but I don't think there's clarity on such a protocol in this scenario.
A basic dispatch procedure based on explicitly supported types/backends could be appealing, as it's conceptually quite similar to the Embedder backends you've built already but oriented for hdbscan dispatch rather than embedders. We do something similar in cuML to enable a variety of input and output data types that we've opted to support.
If BERTopic doesn't expect an explosion of many HDBSCAN backends beyond hdbscan and cuML (like the NumPy/SciPy community does and has for different kinds of arrays), the explicit backend approach you've done for Embedders and the equivalent dispatch approach we took in cuML could work well and be quite lightweight here. Perhaps some kind of dispatching mechanism for module-level functions vaguely like the following might be of interest (but for approximate_predict
, all_points_membership_vectors
, and membership_vector
in hdbscan/cuml) ?
import numpy as np
SUPPORTED_FUNCTIONS = {
"arange",
"empty",
}
def _has_cupy(): # has_cuml
try:
import cupy
return True
except ImportError:
return False
def delegator(obj, func):
if func not in SUPPORTED_FUNCTIONS:
raise AttributeError("Unsupported function")
if isinstance(obj, np.ndarray):
return getattr(np, func)
elif _has_cupy():
import cupy
if isinstance(obj, cupy.ndarray):
return getattr(cupy, func)
else:
raise TypeError("Unsupported backend")
delegator(np.array([0,1]), "arange"), delegator(cp.array([0,1]), "empty") # assume cupy is available at runtime for some users
(<function numpy.arange>,
<function cupy._creation.basic.empty(shape, dtype=<class 'float'>, order='C')>)
This would potentially enable something like:
To become:
if is_supported_hdbscan(self.hdbscan_model):
predictions, probabilities = approximate_predict_dispatch(self.hdbscan_model, umap_embeddings)
And handle both backends.
I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:
DistributionNotFound: The 'pynndescent' distribution was not found and is required by the application
When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install?
cuML and RAPIDS generally follow the NumPy Deprecation Policy and as a result dropped support for Python 3.7 after December 2021.
Colab doesn't support Python 3.8+. This means that RAPIDS libraries on Colab are tied to the 21.12 release. It's possible something in the environment (perhaps cuML but potentially another package) is inconsistent with the pynndescent that pip is trying to install. You can try SageMaker Studio Lab as a Colab replacement, but note that it can take a few tries to get a GPU due to demand. I was able to get a GPU after a few attempts within 3-5 minutes.
If you'd like to try RAPIDS on SageMaker Studio Lab, I recommend using the RAPIDS start page and clicking "Open in Studio Lab", as it provides a getting started notebook.
I was able to use cuML + BERTopic after creating the following environment at the terminal in Studio Lab:
mamba create -n rapids-22.04 -c rapidsai -c nvidia -c conda-forge rapids=22.04 python=3.9 cudatoolkit=11.4
conda activate rapids-22.04
pip install bertopic
(rapids-22.04) studio-lab-user@default:~$ ipython
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from bertopic import BERTopic
...: from cuml.cluster import HDBSCAN
...: from cuml.manifold import UMAP
...: from sklearn.datasets import fetch_20newsgroups
...:
...: docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
...:
...: # Create instances of GPU-accelerated UMAP and HDBSCAN
...: umap_model = UMAP(n_components=5, min_dist=0.0)
...: hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)
...:
...: # Pass the above models to be used in BERTopic
...: topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
...: topics, probs = topic_model.fit_transform(docs)
...:
Downloading: 100%|████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 1.31MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 211kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 10.2k/10.2k [00:00<00:00, 9.32MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 667kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 114kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 3.49MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 396kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:01<00:00, 85.7MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 59.4kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 164kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 8.28MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 343kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 13.2k/13.2k [00:00<00:00, 13.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 5.07MB/s]
Label prop iterations: 23
Label prop iterations: 6
Label prop iterations: 5
Label prop iterations: 4
Label prop iterations: 2
Iterations: 5
1592,148,632,24,235,1116
Label prop iterations: 2
Iterations: 1
329,45,115,9,44,83
Super! Thanks so much for taking the time.
I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:
DistributionNotFound: The 'pynndescent' distribution was not found and is required by the application
When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install?
I could make RAPIDS work on Colab simply by installing BERTopic before running the rapidsai-csp-utils scripts.
Alternatively, you could patch _bertopic.py
and plotting/_topics.py
by changing the imports from umap import UMAP
to cuml.manifold import UMAP
. Not elegant but it works :)
Hi, I have a follow up question. I downloaded and ran the rapidsai-csp-utils scripts after installing BERTopic. But I have issues with importing BERTopic because of a version mismatch of cffi. BERTopic requires version 1.15.0 but rapidsai requires version 1.15.1. I tried (un-)installing the 1.15.0 version but I still got an error. Did you encounter similar issues or know how I could fix this?
Exception when importing BERTopic: Exception: Version mismatch: this is the 'cffi' package version 1.15.1, located in '/usr/local/lib/python3.7/dist-packages/cffi/api.py'. When we import the top-level '_cffi_backend' extension module, we get version 1.15.0, located in '/usr/local/lib/python3.7/dist-packages/_cffi_backend.cpython-37m-x86_64-linux-gnu.so'. The two versions should be equal; check your installation.
It should work if you run pip uninstall -y cffi
followed by pip install cffi
. But don't forget to restart the runtime before importing BERTopic.
Hi @MaartenGr,
Is it possible to run merge_topics
on the cuML implementation?
For one thing, the probs
is missing from the model using cuML HDBSCAN
and I got the following error:
AttributeError Traceback (most recent call last)
Input In [32], in <cell line: 1>()
----> 1 topics= topic_model._map_predictions(topic_model.hdbscan_model.labels)
2 probs = hdbscan.all_points_membership_vectors(topic_model.hdbscan_model)
3 probs = topic_model._map_probabilities(probs, original_topics=True)
File base.pyx:269, in cuml.common.base.Base.__getattr__()
AttributeError: labels
Thank you.
@PeggyFan In BERTopic v0.12 the merge_topics
function should be working with other models besides the default CPU-based HDBSACN model. The code that you shared seems to be custom code so I cannot say much about what is happening there.
The speedup from using cuML for umap and hdbscan is fantastic! However, I was having an issue predicting new instances. An error was thrown when using the .transform function after instantiating with the cuML hdbscan.
This is because the cuML hdbscan does not have a 'predict' function nor is it an instance of hdbscan.HDBSCAN (as pointed out by @beckernick).
Code that causes the issue in .transform: https://github.com/MaartenGr/BERTopic/blob/09c1732997f838050c263ad00ad3c9474e816863/bertopic/_bertopic.py#L427-L437
It seems that an approximate_predict function was recently added to cuml.cluster. https://github.com/rapidsai/cuml/commit/cb2d681640c734f7a2fa07b3f0f2370d988b4df1. So, I was able to hack around this by creating a custom HDBSCAN class as follows:
from cuml.cluster import HDBSCAN, approximate_predict
class GPUHDBSCAN(HDBSCAN):
def predict(self, umap_embeddings):
predictions, probabilities = approximate_predict(self, umap_embeddings)
return predictions
This gives a predict function and seems to circumvent the issue (as long as you don't need probabilities of the predictions).
... Hopefully this helps anyone experiencing the same problem.
It looks like cuML's latest release implemented both approximate_predict
and all_points_membership_vectors
. I'm not sure if it is possible yet but it would be great to see seamless cuML integration into BERTopic!
@ldsands Thank you for mentioning this. I am indeed already working on exploring this implementation within BERTopic. There are a few other features that I am currently working but I'll let you know as soon as a first draft is online!
A few days ago, the v0.13 version of BERTopic was released. It has implemented support cuML's new features and should work nicely. I'll keep this page open for all other updates regarding cuML.
@MaartenGr Thank you very much. Do you plan to update the conda version as well? We had problems to install bertopic over pip on a HPC-Cluster, but it worked well with conda.
@p-dre My apologies, I keep forgetting to update the conda version! I just merged the updated feedstock it so it should be released soon. If it does not work out, please let me know!
Since cuML is now fully supported in BERTopic, I'll close this issue.
In my experience, umap and HDBSCAN are the very computationally intensive parts of Berttopic. However, in the original form, the packages are only partially parallel and not usable on gpu.
However, NVIDIA RAPIDS cuML library (https://github.com/rapidsai/cuml) includes a solution for both models that is usable on gpu. This would significantly increase the speed of the calculation. https://developer.nvidia.com/blog/gpu-accelerated-hierarchical-dbscan-with-rapids-cuml-lets-get-back-to-the-future/ Is an implementation conceivable?