MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

Dimensionality Reduction Tech Issues: Large Dataset cannot go through the UMAP function #1283

Closed KyleX42 closed 1 year ago

KyleX42 commented 1 year ago

Hello Maarten,

I am using BERTopic to do topic modeling with a 30 million sentences corpus. This corpus is around 17GB in csv format and is around 60GB after creating embeddings and being stored it in a pkl file.

My PC is equipped with NVIDIA RTX 3090, AMD Ryzen 5950X and 128GB RAM (virtual RAM set to 3 times). The system is Windows 11 Professional. I am using VSCode - Jupyter Notebook with Python 3.11.

In Windows environment I used the following codes:

sentence_model = SentenceTransformer("all-MiniLM-L6-v2",device='cuda')
umap_model = UMAP(n_neighbors=100, n_components=5, min_dist=0.0, metric='cosine',low_memory=False,verbose=True)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True,verbose=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english",min_df=5)
ctfidf_model = ClassTfidfTransformer()
representation_model = KeyBERTInspired()

topic_model = BERTopic(
  embedding_model=sentence_model,        
  umap_model=umap_model,                    
  hdbscan_model=hdbscan_model,           
  vectorizer_model=vectorizer_model,   
  ctfidf_model=ctfidf_model,             
  representation_model=representation_model,
  calculate_probabilities=False,
  verbose=True,
  low_memory=True
)

The problem I encountered is that it takes thousands of minutes to do NN descent for 25 iteration in the UMAP model. In fact, it cannot even pass through the first iteration after 2000 minutes. Below is what UMAP verbose shows:

UMAP(angular_rp_forest=True, low_memory=False, metric='cosine', min_dist=0.0, n_components=5, n_neighbors=100, verbose=True)
Fri May 19 10:05:41 2023 Construct fuzzy simplicial set
Fri May 19 10:06:16 2023 Finding Nearest Neighbors
Fri May 19 10:06:29 2023 Building RP forest with 64 trees
Fri May 19 10:56:31 2023 NN descent for 25 iterations
     1  /  25

Also if I set UMAP(low_memory=True), it will take unlimited time too.

In Linux (WSL 2) with RAPIDS 23.04, I just replace HDBSCAN and UMAP using the same functions form RAPIDS CUML. It returns the following errors:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[4], line 3
      1 # Train our topic model using our pre-trained sentence-transformers embeddings
      2 #topics, probs = topic_model.fit_transform(stored_sentences, stored_embeddings)
----> 3 topics, _ = topic_model.fit_transform(stored_sentences, stored_embeddings)

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/bertopic/_bertopic.py:356, in BERTopic.fit_transform(self, documents, embeddings, y)
    354 if self.seed_topic_list is not None and self.embedding_model is not None:
    355     y, embeddings = self._guided_topic_modeling(embeddings)
--> 356 umap_embeddings = self._reduce_dimensionality(embeddings, y)
    358 # Cluster reduced embeddings
    359 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/bertopic/_bertopic.py:2837, in BERTopic._reduce_dimensionality(self, embeddings, y, partial_fit)
   2834 # Regular fit
   2835 else:
   2836     try:
-> 2837         self.umap_model.fit(embeddings, y=y)
   2838     except TypeError:
   2839         logger.info("The dimensionality reduction algorithm did not contain the `y` parameter and"
   2840                     " therefore the `y` parameter was not used")

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/api_decorators.py:188, in _make_decorator_function..decorator_function..decorator_closure..wrapper(*args, **kwargs)
    185     set_api_output_dtype(output_dtype)
    187 if process_return:
--> 188     ret = func(*args, **kwargs)
    189 else:
    190     return func(*args, **kwargs)

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/api_decorators.py:393, in enable_device_interop..dispatch(self, *args, **kwargs)
    391 if hasattr(self, "dispatch_func"):
    392     func_name = gpu_func.__name__
--> 393     return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
    394 else:
    395     return gpu_func(self, *args, **kwargs)

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/api_decorators.py:190, in _make_decorator_function..decorator_function..decorator_closure..wrapper(*args, **kwargs)
    188         ret = func(*args, **kwargs)
    189     else:
--> 190         return func(*args, **kwargs)
    192 return cm.process_return(ret)

File base.pyx:665, in cuml.internals.base.UniversalBase.dispatch_func()

File umap.pyx:545, in cuml.manifold.umap.UMAP.fit()

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/nvtx/nvtx.py:101, in annotate.__call__..inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/input_utils.py:369, in input_to_cuml_array(X, order, deepcopy, check_dtype, convert_to_dtype, check_mem_type, convert_to_mem_type, safe_dtype_conversion, check_cols, check_rows, fail_on_order, force_contiguous)
    281 @nvtx_annotate(
    282     message="common.input_utils.input_to_cuml_array",
    283     category="utils",
   (...)
    298     force_contiguous=True,
    299 ):
    300     """
    301     Convert input X to CumlArray.
    302 
   (...)
    367 
    368     """
--> 369     arr = CumlArray.from_input(
    370         X,
    371         order=order,
    372         deepcopy=deepcopy,
    373         check_dtype=check_dtype,
    374         convert_to_dtype=convert_to_dtype,
    375         check_mem_type=check_mem_type,
    376         convert_to_mem_type=convert_to_mem_type,
    377         safe_dtype_conversion=safe_dtype_conversion,
    378         check_cols=check_cols,
    379         check_rows=check_rows,
    380         fail_on_order=fail_on_order,
    381         force_contiguous=force_contiguous,
    382     )
    383     try:
    384         shape = arr.__cuda_array_interface__["shape"]

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/memory_utils.py:87, in with_cupy_rmm..cupy_rmm_wrapper(*args, **kwargs)
     85 if GPU_ENABLED:
     86     with cupy_using_allocator(rmm_cupy_allocator):
---> 87         return func(*args, **kwargs)
     88 return func(*args, **kwargs)

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/nvtx/nvtx.py:101, in annotate.__call__..inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/array.py:1117, in CumlArray.from_input(cls, X, order, deepcopy, check_dtype, convert_to_dtype, check_mem_type, convert_to_mem_type, safe_dtype_conversion, check_cols, check_rows, fail_on_order, force_contiguous)
   1109         if (
   1110             (X < target_dtype_range.min) | (X > target_dtype_range.max)
   1111         ).any():
   1112             raise TypeError(
   1113                 "Data type conversion on values outside"
   1114                 " representable range of target dtype"
   1115             )
   1116     arr = cls(
-> 1117         arr.to_output(
   1118             output_dtype=convert_to_dtype,
   1119             output_mem_type=convert_to_mem_type,
   1120         ),
   1121         order=requested_order,
   1122         index=index,
   1123         validate=False,
   1124     )
   1126 make_copy = force_contiguous and not arr.is_contiguous
   1128 if (
   1129     not fail_on_order and order != arr.order and order != "K"
   1130 ) or make_copy:

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/memory_utils.py:87, in with_cupy_rmm..cupy_rmm_wrapper(*args, **kwargs)
     85 if GPU_ENABLED:
     86     with cupy_using_allocator(rmm_cupy_allocator):
---> 87         return func(*args, **kwargs)
     88 return func(*args, **kwargs)

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/nvtx/nvtx.py:101, in annotate.__call__..inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/array.py:625, in CumlArray.to_output(self, output_type, output_dtype, output_mem_type)
    618             return np.asarray(
    619                 self, dtype=output_dtype, order=self.order
    620             )
    621         return cp.asnumpy(
    622             cp.asarray(self, dtype=output_dtype, order=self.order),
    623             order=self.order,
    624         )
--> 625     return output_mem_type.xpy.asarray(
    626         self, dtype=output_dtype, order=self.order
    627     )
    629 elif output_type == "numba":
    630     return cuda.as_cuda_array(
    631         cp.asarray(self, dtype=output_dtype, order=self.order)
    632     )

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cupy/_creation/from_data.py:76, in asarray(a, dtype, order)
     49 def asarray(a, dtype=None, order=None):
     50     """Converts an object to array.
     51 
     52     This is equivalent to ``array(a, dtype, copy=False)``.
   (...)
     74 
     75     """
---> 76     return _core.array(a, dtype, False, order)

File cupy/_core/core.pyx:2360, in cupy._core.core.array()

File cupy/_core/core.pyx:2384, in cupy._core.core.array()

File cupy/_core/core.pyx:2516, in cupy._core.core._array_default()

File cupy/_core/core.pyx:136, in cupy._core.core.ndarray.__new__()

File cupy/_core/core.pyx:224, in cupy._core.core._ndarray_base._init()

File cupy/cuda/memory.pyx:742, in cupy.cuda.memory.alloc()

File ~/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/rmm/allocators/cupy.py:37, in rmm_cupy_allocator(nbytes)
     34     raise ModuleNotFoundError("No module named 'cupy'")
     36 stream = Stream(obj=cupy.cuda.get_current_stream())
---> 37 buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
     38 dev_id = -1 if buf.ptr else cupy.cuda.device.get_device_id()
     39 mem = cupy.cuda.UnownedMemory(
     40     ptr=buf.ptr, size=buf.size, owner=buf, device_id=dev_id
     41 )

File device_buffer.pyx:85, in rmm._lib.device_buffer.DeviceBuffer.__cinit__()

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/kylex42/anaconda3/envs/rapids-23.04/include/rmm/mr/device/cuda_memory_resource.hpp

with RAPIDS, it looks like the 24GB GPU RAM cannot process this dataset.

I was wondering if there can be any alternative method to go through the Dimensionality reduction step. Thanks a lot!

MaartenGr commented 1 year ago

Hmmm, it might have something to do with the n_neighbors parameters. I remember that being computationally a bit more expensive. Other than that, it might be worthwhile to also post it on the cuML Repo as they know much more about fine-tuning a cuML model.

MaartenGr commented 1 year ago

Closing this due to inactivity. Let me know if I need to re-open the issue!