MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

Implementation from cuML in Berttopic #495

Closed p-dre closed 1 year ago

p-dre commented 2 years ago

In my experience, umap and HDBSCAN are the very computationally intensive parts of Berttopic. However, in the original form, the packages are only partially parallel and not usable on gpu.

However, NVIDIA RAPIDS cuML library (https://github.com/rapidsai/cuml) includes a solution for both models that is usable on gpu. This would significantly increase the speed of the calculation. https://developer.nvidia.com/blog/gpu-accelerated-hierarchical-dbscan-with-rapids-cuml-lets-get-back-to-the-future/ Is an implementation conceivable?

MaartenGr commented 2 years ago

There currently is a GPU-accelerated implementation by rapidsai that you can find here that you can try out. I have yet to try it out but from what I have heard there is quite a big speed-up!

beckernick commented 2 years ago

cc @vibhujawa

MaartenGr commented 2 years ago

@p-dre A few days ago, I released BERTopic v0.10.0 which allows you to use different models for HDBSCAN and UMAP. This also allows you to use the GPU-accelerated version of HDBSCAN and UMAP developed by cuML. After installing cuML, you can run it with BERTopic as follows:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)

It should speed up BERTopic quite a bit! Also, since you now can replace HDBSCAN and UMAP, you could also replace them with algorithms, like PCA and kMeans, which might be a bit faster. It could hurt the quality of the resulting topics though, so some experimentation might be necessary.

VibhuJawa commented 2 years ago

@MaartenGr , Thanks a lot. Its great to learn that now it is possible to use different models for HDBSCAN and UMAP.

From a benchmark perspective on a workflow we saw following speedups on a end to end BERTopic Workflow. (Checkout the full blog here)

UMAP: 2718 s (CPU) to 98 s (GPU) HDBSCAN: 382.00 (CPU) to 92 s (GPU).

p-dre commented 2 years ago

@MaartenGr amazing, Thank you very much!!!

kuchenrolle commented 2 years ago

@MaartenGr As cuml.cluster.HDBSCAN is not an instance of hdbscan.HDBSCAN, the isinstance checks in lines 388, 1431 and 1548 return False, resulting in the probabilities (hdbscan_model.probabilities_) being ignored, although the cuml implementation does provide them.

I'm also wondering whether the hdbscan.HDBSCAN could be initialized with the result from cuml.cluster.HDBSCAN, so that the HDBSCAN.membership_vector method could be used, when BERTopic is called with calculate_probabilities=True?

MaartenGr commented 2 years ago

@kuchenrolle After using the cuml.cluster.HDBSCAN model, you can access the probabilities with topic_model.hdbscan_model.probabilities_. I am not entirely sure though whether we can use the membership_vector in cuml through the original method.

beckernick commented 2 years ago

As a note, membership_vector and all_points_membership_vectors are on our radar for cuML's HDBSCAN.

Perhaps this might be an opportunity to define something like is_hdbscan_like in the spirit of scikit-learn's is_classifier and is_regressor? We use this pattern in Dask quite a bit for duck-typing based checks to support multiple backends via dispatching. (Perhaps explicit dispatching might be of interest here, too).

For example:

def is_dataframe_like(df) -> bool:
    """Looks like a Pandas DataFrame"""
    if (df.__class__.__module__, df.__class__.__name__) == (
        "pandas.core.frame",
        "DataFrame",
    ):
        # fast exec for most likely input
        return True
    typ = df.__class__
    return (
        all(hasattr(typ, name) for name in ("groupby", "head", "merge", "mean"))
        and all(hasattr(df, name) for name in ("dtypes", "columns"))
        and not any(hasattr(typ, name) for name in ("name", "dtype"))
    )

The AutoML library TPOT did something similar when they added support for cuML and defined _is_selector and _is_transformer. They used this pattern again when they later added _is_resampler to include support for the scikit-learn-contrib project imbalanced-learn.

def _is_selector(estimator):
    selector_attributes = [
        "get_support",
        "transform",
        "inverse_transform",
        "fit_transform",
    ]
    return all(hasattr(estimator, attr) for attr in selector_attributes)

I'd be happy to participate in a discussion on this topic if there is interest.

nilsblessing commented 2 years ago

I would also be very interested in the "all_points_membership_vectors" functionality via cuML HDBSCAN. In some use cases this offers a good way to reduce the -1 clusters considerably without significant quality loss. However, with the use of the hdbscan.HDBSCAN implementation and large datasets (several millions of records) it suffers greatly in terms of efficiency.

MaartenGr commented 2 years ago

@beckernick Interesting! Haven't seen such a pattern before but it definitely seems like it would fit nicely with the use cases described here.

Assuming the goal is to have a 1:1 mapping of functionality between the original HDBSCAN and cuML HDBSCAN, a few functions are missing like .membership_vector and I believe .approximate_predict that are necessary to reach the same functionality. Would it make sense to first wait until those are developed before creating a is_hdbscan_like function?

beckernick commented 2 years ago

We're a big fan of these duck typing based utilities. I think whether it makes sense to wait depends on the nature of the integration you'd be interested in supporting. We do plan to expand our HDBSCAN support.

At the moment (if folks didn't want to wait), I suspect we could resolve the "missing probabilities" issue noted above with some duck typing or light special casing around here (and the equivalent in the transform codepath):

https://github.com/MaartenGr/BERTopic/blob/407fd4fdf2e05e80019c1c217972bf3314a41040/bertopic/_bertopic.py#L1431-L1437

Having thought a bit more about the duck typing approach, because functions like all_points_membership_vectors, approximate_predict, and membership_vector are in the top-level module namespace, it's more challenging to rely on pure duck typing alone instead of including some kind of explicit dispatch/delegation process. Protocol-based dispatch mechanisms are elegant (such as NEP-18 and NEP-35 in NumPy), but I don't think there's clarity on such a protocol in this scenario.

A basic dispatch procedure based on explicitly supported types/backends could be appealing, as it's conceptually quite similar to the Embedder backends you've built already but oriented for hdbscan dispatch rather than embedders. We do something similar in cuML to enable a variety of input and output data types that we've opted to support.

If BERTopic doesn't expect an explosion of many HDBSCAN backends beyond hdbscan and cuML (like the NumPy/SciPy community does and has for different kinds of arrays), the explicit backend approach you've done for Embedders and the equivalent dispatch approach we took in cuML could work well and be quite lightweight here. Perhaps some kind of dispatching mechanism for module-level functions vaguely like the following might be of interest (but for approximate_predict, all_points_membership_vectors, and membership_vector in hdbscan/cuml) ?

import numpy as np

SUPPORTED_FUNCTIONS = {
    "arange",
    "empty",
}

def _has_cupy(): # has_cuml
    try:
        import cupy
        return True
    except ImportError:
        return False

def delegator(obj, func):
    if func not in SUPPORTED_FUNCTIONS:
        raise AttributeError("Unsupported function")

    if isinstance(obj, np.ndarray):
        return getattr(np, func)
    elif _has_cupy():
        import cupy
        if isinstance(obj, cupy.ndarray):
            return getattr(cupy, func)
    else:
        raise TypeError("Unsupported backend")

delegator(np.array([0,1]), "arange"), delegator(cp.array([0,1]), "empty") # assume cupy is available at runtime for some users
(<function numpy.arange>,
 <function cupy._creation.basic.empty(shape, dtype=<class 'float'>, order='C')>)

This would potentially enable something like:

https://github.com/MaartenGr/BERTopic/blob/407fd4fdf2e05e80019c1c217972bf3314a41040/bertopic/_bertopic.py#L388-L389

To become:

if is_supported_hdbscan(self.hdbscan_model):
    predictions, probabilities = approximate_predict_dispatch(self.hdbscan_model, umap_embeddings)

And handle both backends.

drob-xx commented 2 years ago

I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:

DistributionNotFound: The 'pynndescent' distribution was not found and is required by the application

When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install?

beckernick commented 2 years ago

cuML and RAPIDS generally follow the NumPy Deprecation Policy and as a result dropped support for Python 3.7 after December 2021.

Colab doesn't support Python 3.8+. This means that RAPIDS libraries on Colab are tied to the 21.12 release. It's possible something in the environment (perhaps cuML but potentially another package) is inconsistent with the pynndescent that pip is trying to install. You can try SageMaker Studio Lab as a Colab replacement, but note that it can take a few tries to get a GPU due to demand. I was able to get a GPU after a few attempts within 3-5 minutes.

If you'd like to try RAPIDS on SageMaker Studio Lab, I recommend using the RAPIDS start page and clicking "Open in Studio Lab", as it provides a getting started notebook.

Screen Shot 2022-06-08 at 10 58 18 AM

I was able to use cuML + BERTopic after creating the following environment at the terminal in Studio Lab:

mamba create -n rapids-22.04 -c rapidsai -c nvidia -c conda-forge rapids=22.04 python=3.9 cudatoolkit=11.4
conda activate rapids-22.04
pip install bertopic
(rapids-22.04) studio-lab-user@default:~$ ipython
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from bertopic import BERTopic
   ...: from cuml.cluster import HDBSCAN
   ...: from cuml.manifold import UMAP
   ...: from sklearn.datasets import fetch_20newsgroups
   ...: 
   ...: docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
   ...: 
   ...: # Create instances of GPU-accelerated UMAP and HDBSCAN
   ...: umap_model = UMAP(n_components=5, min_dist=0.0)
   ...: hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)
   ...: 
   ...: # Pass the above models to be used in BERTopic
   ...: topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
   ...: topics, probs = topic_model.fit_transform(docs)
   ...: 
Downloading: 100%|████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 1.31MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 211kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 10.2k/10.2k [00:00<00:00, 9.32MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 667kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 114kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 3.49MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 396kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:01<00:00, 85.7MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 59.4kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 164kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 8.28MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 343kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 13.2k/13.2k [00:00<00:00, 13.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 5.07MB/s]
Label prop iterations: 23
Label prop iterations: 6
Label prop iterations: 5
Label prop iterations: 4
Label prop iterations: 2
Iterations: 5
1592,148,632,24,235,1116
Label prop iterations: 2
Iterations: 1
329,45,115,9,44,83
drob-xx commented 2 years ago

Super! Thanks so much for taking the time.

thefonseca commented 2 years ago

I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:

DistributionNotFound: The 'pynndescent' distribution was not found and is required by the application

When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install?

I could make RAPIDS work on Colab simply by installing BERTopic before running the rapidsai-csp-utils scripts.

Alternatively, you could patch _bertopic.py and plotting/_topics.py by changing the imports from umap import UMAP to cuml.manifold import UMAP. Not elegant but it works :)

esettouf commented 2 years ago

Hi, I have a follow up question. I downloaded and ran the rapidsai-csp-utils scripts after installing BERTopic. But I have issues with importing BERTopic because of a version mismatch of cffi. BERTopic requires version 1.15.0 but rapidsai requires version 1.15.1. I tried (un-)installing the 1.15.0 version but I still got an error. Did you encounter similar issues or know how I could fix this?

Exception when importing BERTopic: Exception: Version mismatch: this is the 'cffi' package version 1.15.1, located in '/usr/local/lib/python3.7/dist-packages/cffi/api.py'. When we import the top-level '_cffi_backend' extension module, we get version 1.15.0, located in '/usr/local/lib/python3.7/dist-packages/_cffi_backend.cpython-37m-x86_64-linux-gnu.so'. The two versions should be equal; check your installation.

thefonseca commented 2 years ago

It should work if you run pip uninstall -y cffi followed by pip install cffi. But don't forget to restart the runtime before importing BERTopic.

PeggyFan commented 1 year ago

Hi @MaartenGr,

Is it possible to run merge_topics on the cuML implementation? For one thing, the probs is missing from the model using cuML HDBSCAN and I got the following error:

AttributeError                            Traceback (most recent call last)
Input In [32], in <cell line: 1>()
----> 1 topics= topic_model._map_predictions(topic_model.hdbscan_model.labels)
      2 probs = hdbscan.all_points_membership_vectors(topic_model.hdbscan_model)
      3 probs = topic_model._map_probabilities(probs, original_topics=True)

File base.pyx:269, in cuml.common.base.Base.__getattr__()

AttributeError: labels

Thank you.

MaartenGr commented 1 year ago

@PeggyFan In BERTopic v0.12 the merge_topics function should be working with other models besides the default CPU-based HDBSACN model. The code that you shared seems to be custom code so I cannot say much about what is happening there.

emarsc commented 1 year ago

The speedup from using cuML for umap and hdbscan is fantastic! However, I was having an issue predicting new instances. An error was thrown when using the .transform function after instantiating with the cuML hdbscan.

This is because the cuML hdbscan does not have a 'predict' function nor is it an instance of hdbscan.HDBSCAN (as pointed out by @beckernick).

Code that causes the issue in .transform: https://github.com/MaartenGr/BERTopic/blob/09c1732997f838050c263ad00ad3c9474e816863/bertopic/_bertopic.py#L427-L437

It seems that an approximate_predict function was recently added to cuml.cluster. https://github.com/rapidsai/cuml/commit/cb2d681640c734f7a2fa07b3f0f2370d988b4df1. So, I was able to hack around this by creating a custom HDBSCAN class as follows:

from cuml.cluster import HDBSCAN, approximate_predict

class GPUHDBSCAN(HDBSCAN):
    def predict(self, umap_embeddings):
        predictions, probabilities = approximate_predict(self, umap_embeddings)
        return predictions

This gives a predict function and seems to circumvent the issue (as long as you don't need probabilities of the predictions).

... Hopefully this helps anyone experiencing the same problem.

ldsands commented 1 year ago

It looks like cuML's latest release implemented both approximate_predict and all_points_membership_vectors. I'm not sure if it is possible yet but it would be great to see seamless cuML integration into BERTopic!

MaartenGr commented 1 year ago

@ldsands Thank you for mentioning this. I am indeed already working on exploring this implementation within BERTopic. There are a few other features that I am currently working but I'll let you know as soon as a first draft is online!

MaartenGr commented 1 year ago

A few days ago, the v0.13 version of BERTopic was released. It has implemented support cuML's new features and should work nicely. I'll keep this page open for all other updates regarding cuML.

p-dre commented 1 year ago

@MaartenGr Thank you very much. Do you plan to update the conda version as well? We had problems to install bertopic over pip on a HPC-Cluster, but it worked well with conda.

MaartenGr commented 1 year ago

@p-dre My apologies, I keep forgetting to update the conda version! I just merged the updated feedstock it so it should be released soon. If it does not work out, please let me know!

MaartenGr commented 1 year ago

Since cuML is now fully supported in BERTopic, I'll close this issue.