MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.17k stars 764 forks source link

How can I use GPU when I running BERTopic.. #545

Closed sxxyxn closed 2 years ago

sxxyxn commented 2 years ago
import pandas as pd
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from konlpy.tag import Mecab
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

df = pd.read_csv('/home/ysy/work/202101_청구내용_1차_전처리.csv', encoding='cp949')

def token(text):
    m = Mecab()
    words = m.nouns(text)
    return words

vectorizer = CountVectorizer(tokenizer=token)

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
model = BERTopic(embedding_model=SentenceTransformer('sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens'),
                 vectorizer_model=vectorizer,
                 nr_topics=50,
                 top_n_words=10,
                 calculate_probabilities=True,
                umap_model=umap_model,
                hdbscan_model=hdbscan_model)

topics, probs = model.fit_transform(df['청구내용'])

this is my code. I'm monitoring the GPU resource during run BERTopic. but it never works... how can I use GPU? I tried without umap, hdbscan code. but it works same way...lol

MaartenGr commented 2 years ago

The SentenceTransformer model should automatically select the GPU if it can find one. To check whether a correct CUDA-enabled GPU can be found in your environment, it would be helpful to run the following:

>>> import torch

>>> torch.cuda.is_available()
True

>>> torch.cuda.device_count()
1

>>> torch.cuda.current_device()
0

>>> torch.cuda.device(0)
<torch.cuda.device at 0x7efce0b03be0>

>>> torch.cuda.get_device_name(0)
'GeForce GTX 950M'

This was extracted from this StackOverflow post which provides a bit more detail.

It should be noted though that not all parts of BERTopic use the GPU, only when embedding the documents. So it would not be unsurprising if you would only see GPU-usage in the early stages of fitting your model.

sxxyxn commented 2 years ago
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3080'

it comes out GPU's name. but it doesn't work when BERTopic is running.. and I tried model.to(device) but it came out Atrribute Error 😥 I bought the GPU for BERTopic T.T I really wanna see what GPU is working

MaartenGr commented 2 years ago

Could you also share the output from the other lines of code (e.g., toch.cuda.is_available, toch.cuda.current_device, etc)?

It might be worthwhile to check the performance of sentence-transformers. To do so, please run the following:

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2", device='cuda')
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Could you share how long the encode part of the above takes? The show_progress_bar is set to True to give you an idea of the iterations per second, please also share that value.

sxxyxn commented 2 years ago
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.device(0)
<torch.cuda.device at 0x7f670f52fb20>

and unfortunately my computer is out of internet😥 so, I cannot load fetch_20newsgroups.. but definitely BERTopic is faster than when it doesn't use 'UMAP, HDBSCAN' I just wondering why can't I see my GPU is working ..

MaartenGr commented 2 years ago

I am not entirely sure what is happening but if it finds the device, it should be using it automatically seeing as CUDA is properly installed. Instead of using fetch_20newsgroups it might be worthwhile to use a dataset of your own to get an understanding of the speed at which sentence-transformers is working.

Another option would be to create a completely fresh environment and install CUDA-enabled torch there. You can find the instructions for installing torch here.

MaartenGr commented 2 years ago

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!