MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

Multi-GPU Utilisation #1837

Open ShabnamRA opened 6 months ago

ShabnamRA commented 6 months ago

Hi Maarten, I'm attempting to execute one of your examples in Google Colab for processing large-scale databases. Here are the specifications of my machine: 8 NVIDIA A100 cards and a 50TB SSD. However, when running the code, it appears to only utilize one of the GPUs. Could you advise on how I can distribute the workload across all 8 cards?

import numpy as np
from torch import cuda

# Set device to use all available GPUs
num_gpus = cuda.device_count()
if num_gpus > 0:
    device_ids = list(range(num_gpus))  # Assuming GPUs are indexed from 0 to 7
    device = f'cuda:{device_ids[0:8]}'  # Set the device to the first GPU
    print("Available GPUs:", num_gpus)
    print("Using GPUs:", device_ids)
else:
    device = 'cpu'
    print("CUDA is not available. Using CPU.")

print("Device:", device)
##################
from datasets import load_dataset
# Extract 1 millions records
lang = 'en'
data = load_dataset(f"Cohere/wikipedia-22-12", lang, split='train', streaming=True)
docs = [doc["text"] for doc in data if doc["id"] != "1_000_000"];
# Embeddings
from sentence_transformers import SentenceTransformer
# Create embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(docs, show_progress_bar=True)
import collections
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer

# Extract vocab to be used in BERTopic
vocab = collections.Counter()
tokenizer = CountVectorizer().build_tokenizer()
for doc in tqdm(docs):
    vocab.update(tokenizer(doc))
vocab = [word for word, frequency in vocab.items() if frequency >= 15];
len(vocab)

# Train BERTopic
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic

# Prepare sub-models
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20,
                        verbose=True)
vectorizer_model = CountVectorizer(vocabulary=vocab, stop_words="english")

# Fit BERTopic without actually performing any clustering
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    verbose=True
).fit(docs, embeddings=embeddings)
MaartenGr commented 6 months ago

That depends on the underlying models that you choose and whether they support multi-GPU. For instance, I believe cuML's UMAP has a multi-GPU implementation although I'm not sure whether that is found in both training and inference. You would have to check the underlying models whether that is supported.