MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.06k stars 756 forks source link

Ways to scale topic prediction to billions of unseen documents #577

Open cedivad opened 2 years ago

cedivad commented 2 years ago

Hello! I'm mostly just looking for a sanity check :-)

I would like to train BERTopic over a small subset (1M) of the documents I have and then have it predict on the entire dataset (1B). I already have the embeddings, so I'm just looking for ways to turn them into topics as fast as possible.

1) The cuML implementation is still missing an .approximate_predict() function, so we can't use their HDBSCAN model. 2) We can however use cuML's UMAP along with SKLearn's HDBSCAN.

With this approach (mixed cuML UMAP + SKLearn HDBSCAN) I can train the model on 600k documents in 13 minutes (770 doc/s) and then infer on the same batch in 6 minutes (1600 doc/s).

cuML UMAP is only using the GPU for a short amount of time to process the embeddings (think 20 seconds), and everything else is single-threaded, so it should scale ok-ish on multicore machines with a single GPU (I believe there is no reason why a single machine with multiple instances of BERTopic shouldn't do 10k+ second, which is close to a billion per day).

What I'm wondering is, are there better ways of doing this? Like should I use the topic predictions from fitting BERTopic to train a classifier on top of my own embeddings?

If none of this sounds extremely silly I'll go ahead and try to package BERTopic with Triton as an inference server and see how many errors I get :-)

MaartenGr commented 2 years ago

I would like to train BERTopic over a small subset (1M) of the documents I have and then have it predict on the entire dataset (1B). I already have the embeddings, so I'm just looking for ways to turn them into topics as fast as possible.

That is a rather big dataset, cool! I agree with your approach, training on a relatively small subset and predicting all others. The main thing to look out for is how you initialize your small subset. If you have metadata, make sure that the distribution of the metadata in the smaller dataset matches that of the entire dataset. That way, you can prevent any bias from training on the smaller subset.

What I'm wondering is, are there better ways of doing this? Like should I use the topic predictions from fitting BERTopic to train a classifier on top of my own embeddings?

At that size of data, I am not entirely sure what the main bottleneck would be. During inference, which steps of the algorithm typically take the longest? Perhaps there is something there that we can optimize. I can imagine that HDBSCAN to be quite slow since you are using the sklearn-contrib version of it. Although not ideal, perhaps it would be interesting to use a different clustering algorithm that is a bit faster on that scale, like cuML's k-Means implementation. However, I would not expect the same performance as what you would normally get with HDBSCAN.

Very curious to see how it will work for you, let me know! 😄

cedivad commented 2 years ago

A small update, as requested :)

After training I ended up with something in the region of 12k topics using 12-dimensional embeddings following UMAP reduction (on a 4M sample of the dataset). Yes, I know, this might as well be profanity, but I was quite happy with the quality of the micro-clusters.

I did use triton for serving as planned, but I had to split the UMAP and HDBSCAN jobs on different servers, as thanks to the huge number of clusters and their high input dimension I needed a metric ton of CPU power, so I used a single UMAP GPU server (2x 3090) and many CPU servers (8x 5950x). The bottleneck in this setup appears to be the GPUs, while the CPUs are running at maybe 30% usage.

All this to run at barely 4k inferences per second 😅

Screenshot 2022-06-28 at 12 55 03

For packing the model for serving I simply pickle'd the umap and hdbscan models off the BERTopic class and onto a file, and wrote my small inference function. That way I can turn a 20GB saved model into ~2GB, that helps with the memory consumption when running multiple copies in parallel.

MaartenGr commented 2 years ago

@cedivad Thank you for sharing this with us! Really interesting to see the steps that you needed to take in order to get to that 4k inferences per second.

After training I ended up with something in the region of 12k topics using 12-dimensional embeddings following UMAP reduction (on a 4M sample of the dataset). Yes, I know, this might as well be profanity, but I was quite happy with the quality of the micro-clusters.

To me, definitely not profanity. Especially with 4 million documents, it would definitely not be unexpected to have thousands of topics. I can imagine that lowering that to a couple of hundred topics will result in too abstract and vague topics.

I did use triton for serving as planned, but I had to split the UMAP and HDBSCAN jobs on different servers, as thanks to the huge number of clusters and their high input dimension I needed a metric ton of CPU power, so I used a single UMAP GPU server (2x 3090) and many CPU servers (8x 5950x). The bottleneck in this setup appears to be the GPUs, while the CPUs are running at maybe 30% usage.

All this to run at barely 4k inferences per second 😅 For packing the model for serving I simply pickle'd the umap and hdbscan models off the BERTopic class and onto a file, and wrote my small inference function. That way I can turn a 20GB saved model into ~2GB, that helps with the memory consumption when running multiple copies in parallel.

I am seeing more and more users wanting to have fast inference (and potentially also online learning) so it is nice to see the steps you have taken to get to that speed. Seeing as the bottlenecks are mostly in the GPU area, at least with respect to UMAP and HDBSCAN, are there things you would like to see in BERTopic that might speed things up a bit?

cedivad commented 2 years ago

I think BERTopic is surprisingly fast out of the box (at least when you only care about the two models and discard all of the tf-idf data etc). Speed depends mostly on the decisions you make during training, so you can just decide to avoid a humongous model when your processing power doesn't allow for it, trading a bit of accuracy for being actually able to run the model at the scale you require.

Above, with my original, much easier, mock model, I talk about 600k documents being processed by cuUMAP in 20 seconds – so that's 30k per second on a single GPU. I forgot the details, but this was a reasonable model with maybe 1000 topics, so very applicable to most people, so I don't think the GPU is the bottleneck! It can become so in some setups, but they should be rare. I think even my decision of using a n_components of 12 because I felt like it was producing slightly better topics would have improved speed a ton on both devices.

I know NVIDIA will eventually release a cuML version of approximate_predict() but I'm beginning to wonder if it's truly needed. Remember their cuML HDBSCAN is only 3x faster than the CPU one, and you can parallelise inference on the CPU in multiple cores while GPUs are more expensive. I guess we will see when it's released, maybe the gains are better for more complex HDBSCAN models.

I also think I got lucky with having the CPU power available for HDBSCAN or it would have been an eternal process. I was using 30% average CPU on that cluster, so that adds up to maybe 150-200k passmark cpubenchmark points. There are multi-core monsters that are that fast, but the simplest option remains reducing your model's complexity.

Btw, I will have to retrain because only ~30% of my dataset was assigned a topic, while it got up to ~55% with the 4M subsample 🤕

cedivad commented 2 years ago

Here is what I worked on, it should help others getting started on the way to a billion BERTopic inferences :)

https://github.com/cedivad/BERTopic-deploy

MaartenGr commented 2 years ago

Great, thanks for sharing. This is very helpful to those who are bringing BERTopic into production, especially on such a large scale!

Vathsa28 commented 2 years ago

I too am trying to use BERTopic to do for Fully qualified domain names.. It would be really helpful if someone helped me to do this.. Because i get 20 M data per day in my organisation and it is really tough to distriubte tasks on my GPU.

ericchagnon15 commented 1 year ago

@cedivad

How did you handle the large amount of documents not being assigned to topics. I am working on a similarly large dataset and am coming across a similar issue where 70-80% of my documents are being assigned to the outlier topic