Closed KyleX42 closed 1 year ago
Hmmm, it might have something to do with the n_neighbors
parameters. I remember that being computationally a bit more expensive. Other than that, it might be worthwhile to also post it on the cuML Repo as they know much more about fine-tuning a cuML model.
Closing this due to inactivity. Let me know if I need to re-open the issue!
Hello Maarten,
I am using BERTopic to do topic modeling with a 30 million sentences corpus. This corpus is around 17GB in csv format and is around 60GB after creating embeddings and being stored it in a pkl file.
My PC is equipped with NVIDIA RTX 3090, AMD Ryzen 5950X and 128GB RAM (virtual RAM set to 3 times). The system is Windows 11 Professional. I am using VSCode - Jupyter Notebook with Python 3.11.
In Windows environment I used the following codes:
The problem I encountered is that it takes thousands of minutes to do NN descent for 25 iteration in the UMAP model. In fact, it cannot even pass through the first iteration after 2000 minutes. Below is what UMAP verbose shows:
Also if I set UMAP(low_memory=True), it will take unlimited time too.
In Linux (WSL 2) with RAPIDS 23.04, I just replace HDBSCAN and UMAP using the same functions form RAPIDS CUML. It returns the following errors:
with RAPIDS, it looks like the 24GB GPU RAM cannot process this dataset.
I was wondering if there can be any alternative method to go through the Dimensionality reduction step. Thanks a lot!