Closed schetudiante closed 1 year ago
No problem, glad I could be of help!
There are several ways to perform computation with large datasets. First, you can set low_memory
to True when instantiating BERTopic. This may prevent blowing up the memory in UMAP.
Second, setting calculate_probabilities
to False when instantiating BERTopic prevents a huge document-topic probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.
Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting sparse c-TF-IDF matrix. You can do this as follows:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)
The min_df parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.
Lastly, and this is a bit on the nose, simply use a machine with more RAM available. Some machines are simply not meant to process such large datasets or memory-intensive algorithms and using one, if available, could help.
Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.
Hopefully, this helps a bit!
你好大神,我计划对百万级别的文本做模型训练,耗时时间过长,我尝试把文本切割成多个小list,增量迭代进行训练,transform不能帮助我实现对模型增量迭代训练,请问还有其他途径吗?
@TigerShuai Hopefully, Google Translate was accurate in translating your issue. It seems that you want to iteratively train a BERTopic model since you have too many documents that take too long to train.
Unfortunately, this is not supported and is unlikely to be supported in the future as the model performs best when you use all documents. Having said that, I would advise several things. First, make sure you use a strong GPU. This will speed up the training procedure quite a bit. Second, set low_memory=True
if you are experiencing memory issues. Third, set calculate_probabilities=False
as that is a very slow procedure. Finally, I would advise you to use verbose=True
and see where the training slows down. If I know what takes so long, perhaps I can propose a solution!
No problem, glad I could be of help!
There are several ways to perform computation with large datasets. First, you can set
low_memory
to True when instantiating BERTopic. This may prevent blowing up the memory in UMAP.Second, setting
calculate_probabilities
to False when instantiating BERTopic prevents a huge document-topic probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting sparse c-TF-IDF matrix. You can do this as follows:
from bertopic import BERTopic from sklearn.feature_extraction.text import CountVectorizer vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10) topic_model = BERTopic(vectorizer_model=vectorizer_model)
The min_df parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.
Lastly, and this is a bit on the nose, simply use a machine with more RAM available. Some machines are simply not meant to process such large datasets or memory-intensive algorithms and using one, if available, could help.
Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.
Hopefully, this helps a bit!
Thank you Maarten for your graciousness! I am just caught about two things:
Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.
Can you if possible: give an example to save the (huge) model and visualise like you say?
It would be an awesome thing (and I already feel like sending you a gift now :) You inspire us!
You can save the model with:
from bertopic import BERTopic
topic_model = BERTopic().fit(docs)
topic_model.save("my_model")
Then, you can use the saved model to visualize the topics. You can find a bit more about saving and loading in the documentation here.
The part about the weighted sample is something you will have to code yourself. Adjust this code to only select a number of documents to visualize (e.g., 100,000 documents instead of 1.3 million). There, you need to add this df = df.sample(n=100_000)
directly after df["topic"] = topics
to visualize only a sample of the documents.
Hey Maarten,
Thanks for your 'ever graciousness' big thank you for that Once Again Sir!
Can you give a pointer how I can load the documents into the model , because I have 1.5 million sentences (documents) which are in a text file (or I can put them in multiple text files).
Because I see that the model takes one argument while loading the documents , like here:
topics, _ = topic_model.fit_transform(docs)
What I mean is, before, I would load the documents (upto the size of say, 32000 etc) into a list simply and then put the list as the docs argument, I wonder if the same maybe done for all the 1.5 mil documents or maybe you can suggest some other way.
still waiting on this btw,
It would be an awesome thing (and I already feel like sending you a gift now :) (https://github.com/MaartenGr/BERTopic/issues/151#issuecomment-870529679)
Personally, I would simply put all those 1.5 million sentences (documents) in a list and then put that list as the docs argument. If have you have enough RAM available, this should be no issue. If, however, you run into memory issues, then I would advise you to look here for a few tips on how to run BERTopic on large data.
It would be an awesome thing (and I already feel like sending you a gift now :) (#151 (comment))
Don't worry about that! I'm just glad that I can help out.
Personally, I would simply put all those 1.5 million sentences (documents) in a list and then put that list as the docs argument. If have you have enough RAM available, this should be no issue. If, however, you run into memory issues, then I would advise you to look here for a few tips on how to run BERTopic on large data.
It would be an awesome thing (and I already feel like sending you a gift now :) (#151 (comment))
Don't worry about that! I'm just glad that I can help out.
Is it possible to fit first, and then transform the documents in small chunks (i.e. not use fit_transform
but calling fit
first and then calling transform
on smaller chunks of data). @MaartenGr
If I have 1.5 million of sentences, for example, can I fit with all 1.5 million sentences and then transform 500k sentences at a time for 3 times?
@ginward You can definitely fit the model once on a subset of the data and simply transform for all others. Typically, you can get away with a few hundred thousand documents. You really do not need to train on millions of sentences to improve the model as sufficient data is most likely already given.
Thus, you can fit
on 200,000 sentences and simply predict the other 1.3 million sentences.
The only thing that you should take into account is selecting those 200,000 sentences. If you are looking for very specific topics that are likely to only appear a few thousand times, then there is a good chance that you will not capture those in the model. Thus, proper sampling here is key.
@MaartenGr I see. So should I call fit
on the 200,000 sentences, and than call transform
on the 1.3 million sentences?
But if transform takes a lot of memory, can I transform on smaller chunks (such as several 200,000 chunks summing up to 1.3 million sentences), and then combine the results together?
Yes, you can fit
on the 200,00 sentences, and then call transform
on the remaining 1.3 million sentences as long as you are sure that the 200,000 sentences are a good representation of the remaining 1.3 million sentences.
The fit
stage can take a lot of memory, whereas the transform
stage should be much smaller. It should be okay to transform them all at once. However, if you are still experiencing memory issues, there should be no issue in separating them into smaller chunks and combining the results.
@MaartenGr I see. Is it possible to further reduce the memory usage by tuning the hyperparameter of UMAP (i.e. reducing the dimensionality of the document embeddings further) or HDBSCAN (fewer clusters)? And why did you use HDBSCAN for nearest neighbours, but not K-nearest neighbours algorithm?
For example, if K-means consume less memory, can we use k-means instead of HDBSCAN?
It seems that the memory issues occur not in the Sentence Embedding stage or the HMAP stage, but the HDBSCAN stage. I currently have about 10 million short sentences. I think it is the final stage that the memory usage shoots up.
There are a few tricks you can do with respect to UMAP and HDBSCAN, there are outlined here. In practice, there are a number of places where memory consumption may increase (UMAP, HDBSCAN, c-TF-IDF, etc.) that you can optimize in the link above.
Swapping out HDBSCAN for k-Means will reduce in a significantly less accurate model. There are quite a few benefits to HDBSCAN over k-Means including outlier detection, hierarchical nature, density efficiency, etc.
With 10 million sentences, I would advise not to try to optimize the algorithms but focus on the implementation of BERTopic. Like you mentioned, fit on a subset and predict for all others.
@MaartenGr What if I reduce the UMAP reduced dimension to 2 (in the source code it was set to five originally)? Would that relieve some of the burden that HDBSCAN bears?
That would likewise reduce the quality of the model and is not something I would recommend. Since you have millions of data points, I would not advise training on the entire dataset to lower the memory necessity. That seems to be the most efficient way of handling this without the need to optimize/chance/adopt the sub-algorithms.
@MaartenGr Thanks. What is the maximum number of sentences that the model can handle from your experience?
This is a difficult question to answer since it highly depends on your hardware specs. A free google colab session handles a couple of hundred thousand sentences without issues but runs into issues when you approach the million. However, there are plenty of organizations (including where I work currently) that can handle a couple of million sentences without any problems.
Also, it depends on the length of the sentences, the number of words, vocabulary size, etc.
@MaartenGr Is there also a way to separate the process of sentence embedding, UMAP and HDBSCAN by saving the intermediary models? If the memory blows up at the last stage (HDBSCAN), I would need to re-do the sentence embedding and UMAP parts, and it is going to take another few hours.
This is a difficult question to answer since it highly depends on your hardware specs. A free google colab session handles a couple of hundred thousand sentences without issues but runs into issues when you approach the million. However, there are plenty of organizations (including where I work currently) that can handle a couple of million sentences without any problems.
Also, it depends on the length of the sentences, the number of words, vocabulary size, etc.
I currently have a Colab Pro+ subscription with 55GB RAM, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN.
I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.
I am not sure if a single GPU card can use all the 96GB RAM available in the machine, as the other 48GB is in the other three GPU cards. But nevertheless, the model still blows up at the last stage. @MaartenGr
You can try to embed the sentences beforehand by following this piece of documentation. After that, you can simply save the embeddings and load them in when necessary. There currently is not an implementation for UMAP.
I currently have a Colab Pro+ subscription with 55GB memory, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN. I have have a HPC access to a machine with 4 GPUs and 96GB memories in total. But I can only use one GPU and it still blows up the very last stage.
The "last stage" technically is not HDBSCAN but topic extraction with c-TF-IDF and MMR. Having said that, I cannot judge what exactly is happening here without knowing the code you are using. Could you share the code for training BERTopic? Also, if you have set verbose=True
, what has it printed until you get the memory issues?
I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.
Do you mean VRAM or RAM? HDBSCAN is not gpu-accelerated.
@MaartenGr If only the sentence transformer part is done on GPU, can I train the embeddings first and then run the other parts on a machine with only CPU access? I have a machine with 128 GB CPU RAM, which might just work OK.
@MaartenGr It is 96GB RAM and 16GB VRAM. Apparently 96GB RAM is not enough to get the 10 million sentences done.
I am using a customised dataset, but the code is here:
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=20)
umap_model = UMAP(n_neighbors=15, n_components=3, min_dist=0.0, metric='cosine', low_memory = True)
# Setting HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=10, umap_model = umap_model, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
topic_model = BERTopic(verbose=True, seed_topic_list=seed_topic_list, embedding_model="paraphrase-MiniLM-L3-v2", low_memory=True, calculate_probabilities=False, vectorizer_model=vectorizer_model)
I have set the min_df=20
, which is a very large threshold.
@MaartenGr Would setting ngram_range=(1, 1)
help though? It might reduce the TF-IDF matrix size.
Setting ngram_range=(1,1)
would help but reduces the ease of interpretation and topic representation quality since 2-grams often give interesting insights. For millions of sentences, a min_df=20
isn't actually a very large threshold. I think it should pose no issue setting that too at least 100. If you have millions of sentences, then the frequencies of words in your vocab tend to be quite large.
If only the sentence transformer part is done on GPU, can I train the embeddings first and then run the other parts on a machine with only CPU access? I have a machine with 128 GB CPU RAM, which might just work OK.
Yes, only the embedding part benefits from having a GPU.
You can try to embed the sentences beforehand by following this piece of documentation. After that, you can simply save the embeddings and load them in when necessary. There currently is not an implementation for UMAP.
I currently have a Colab Pro+ subscription with 55GB memory, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN. I have have a HPC access to a machine with 4 GPUs and 96GB memories in total. But I can only use one GPU and it still blows up the very last stage.
The "last stage" technically is not HDBSCAN but topic extraction with c-TF-IDF and MMR. Having said that, I cannot judge what exactly is happening here without knowing the code you are using. Could you share the code for training BERTopic? Also, if you have set
verbose=True
, what has it printed until you get the memory issues?I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.
Do you mean VRAM or RAM? HDBSCAN is not gpu-accelerated.
I think it crashed at the UMAP stage @MaartenGr :
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
/tmp/ipykernel_159881/2313223528.py in <module>
1 #topics, probs = topic_model.fit_transform(docs)
2
----> 3 topic_model = topic_model.fit(docs, embeddings)
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit(self, documents, embeddings, y)
210 ```
211 """
--> 212 self.fit_transform(documents, embeddings, y)
213 return self
214
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y)
286 if self.seed_topic_list is not None and self.embedding_model is not None:
287 y, embeddings = self._guided_topic_modeling(embeddings)
--> 288 umap_embeddings = self._reduce_dimensionality(embeddings, y)
289
290 # Cluster UMAP embeddings with HDBSCAN
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in _reduce_dimensionality(self, embeddings, y)
1364 low_memory=self.low_memory).fit(embeddings, y=y)
1365 else:
-> 1366 self.umap_model.fit(embeddings, y=y)
1367 umap_embeddings = self.umap_model.transform(embeddings)
1368 logger.info("Reduced dimensionality with UMAP")
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in fit(self, X, y)
2551
2552 if self.transform_mode == "embedding":
-> 2553 self.embedding_, aux_data = self._fit_embed_data(
2554 self._raw_data[index], n_epochs, init, random_state, # JH why raw data?
2555 )
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in _fit_embed_data(self, X, n_epochs, init, random_state)
2578 replaced by subclasses.
2579 """
-> 2580 return simplicial_set_embedding(
2581 X,
2582 self.graph_,
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose)
1052 elif isinstance(init, str) and init == "spectral":
1053 # We add a little noise to avoid local minima for optimization to come
-> 1054 initialisation = spectral_layout(
1055 data,
1056 graph,
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
299
300 if n_components > 1:
--> 301 return multi_component_layout(
302 data,
303 graph,
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/spectral.py in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
236 num_lanczos_vectors = max(2 * k + 1, int(np.sqrt(component_graph.shape[0])))
237 try:
--> 238 eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
239 L,
240 k,
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
1682 raise ValueError("unrecognized mode '%s'" % mode)
1683
-> 1684 params = _SymmetricArpackParams(n, k, A.dtype.char, matvec, mode,
1685 M_matvec, Minv_matvec, sigma,
1686 ncv, v0, maxiter, which, tol)
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in __init__(self, n, k, tp, matvec, mode, M_matvec, Minv_matvec, sigma, ncv, v0, maxiter, which, tol)
510 raise ValueError("k must be less than ndim(A), k=%d" % k)
511
--> 512 _ArpackParams.__init__(self, n, k, tp, mode, sigma,
513 ncv, v0, maxiter, which, tol)
514
/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in __init__(self, n, k, tp, mode, sigma, ncv, v0, maxiter, which, tol)
340 ncv = min(ncv, n)
341
--> 342 self.v = np.zeros((n, ncv), tp) # holds Ritz vectors
343 self.iparam = np.zeros(11, arpack_int)
344
MemoryError: Unable to allocate 294. GiB for an array with shape (11580087, 3402) and data type float64
@MaartenGr What I don't understand is why the dimensionality of the UMAP matrix is (11580087, 3402)
. I understand that 11580087
is the document size, but shouldn't the size of the embedding be 382? How come it is 3402?
I also see here that the official website recommends the following procedure::
Consider a typical pipeline: high-dimensional embedding (300+) => PCA to reduce to 50 dimensions => UMAP to reduce to 10-20 dimensions => HDBSCAN for clustering / some plain algorithm for classification.
Which means adding a PCA layer before the UMAP layer. Would that sufficiently reduce data usage? @MaartenGr
For the specific regarding UMAP and it's intermediate steps I would refer you to the corresponding issues page of UMAP.
Introducing PCA into the pipeline is actually likely to degrade performance. Although PCA can remove noise which might actually benefit UMAP, there is a high possibility of removing too much information in the reduction process. For example, if you have embeddings of size 368, then what should the reduced dimensionality be before applying UMAP? And what if the initial dimensionality increases? In other words, it becomes difficult to guarantee some stability of BERTopic.
Having said that, you are free to try it out yourself. However, a smaller dataset, which is already quite large, should already be more than sufficient in capturing the necessary information/clusters.
@MaartenGr I currently found a machine with 1TB of RAM, and it is no longer throwing memory errors on my 10 million sentences. However, it seems that it takes a very long time to run the UMAP step, even if I use the PCA to reduce the dimensionality to 50 before UMAP. Any idea why?
I also have 128 CPU cores on the machine, but it seems that most CPU resources are just idle.
UMAP can be quite expensive with its approximate nearest neighbor search. It is not unsurprising that with 10 million sentences each containing a high-dimensional vector that UMAP may take some time to complete. With respect to the CPU cores, I cannot give a definitive answer without understanding the machine and setup. Having said that, these questions seem to be rather specific to UMAP. I would advise you to post them on the issues page of UMAP as I am quite sure they could provide you with a much better answer than I am.
@MaartenGr After setting init
to random in UMAP, I seem to have passed UMAP's initialisation stage. However, it throws the following error in HDBSCAN now:
Any idea why?
This is related to this issue.
UMAP(angular_rp_forest=True, dens_frac=0.0, dens_lambda=0.0, init='random',
low_memory=False, metric='cosine', min_dist=0.0, n_components=5,
verbose=True)
Construct fuzzy simplicial set
Fri Oct 1 05:16:34 2021 Finding Nearest Neighbors
Fri Oct 1 05:16:36 2021 Building RP forest with 64 trees
Fri Oct 1 05:24:38 2021 NN descent for 23 iterations
1 / 23
2 / 23
3 / 23
4 / 23
5 / 23
6 / 23
7 / 23
8 / 23
9 / 23
10 / 23
Stopping threshold met -- exiting after 10 iterations
Fri Oct 1 05:56:27 2021 Finished Nearest Neighbor Search
Fri Oct 1 05:57:32 2021 Construct embedding
completed 0 / 200 epochs
completed 20 / 200 epochs
completed 40 / 200 epochs
completed 60 / 200 epochs
completed 80 / 200 epochs
completed 100 / 200 epochs
completed 120 / 200 epochs
completed 140 / 200 epochs
completed 160 / 200 epochs
completed 180 / 200 epochs
Fri Oct 1 06:31:02 2021 Finished embedding
2021-10-01 06:31:57,225 - BERTopic - Reduced dimensionality with UMAP
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""
The above exception was the direct cause of the following exception:
BrokenProcessPool Traceback (most recent call last)
/tmp/ipykernel_778601/2313223528.py in <module>
1 #topics, probs = topic_model.fit_transform(docs)
2
----> 3 topic_model = topic_model.fit(docs, embeddings)
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py in fit(self, documents, embeddings, y)
210 ```
211 """
--> 212 self.fit_transform(documents, embeddings, y)
213 return self
214
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y)
289
290 # Cluster UMAP embeddings with HDBSCAN
--> 291 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)
292
293 # Sort and Map Topic IDs by their frequency
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py in _cluster_embeddings(self, umap_embeddings, documents)
1384 probabilities: The distribution of probabilities
1385 """
-> 1386 self.hdbscan_model.fit(umap_embeddings)
1387 documents['Topic'] = self.hdbscan_model.labels_
1388 probabilities = self.hdbscan_model.probabilities_
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
917 self._condensed_tree,
918 self._single_linkage_tree,
--> 919 self._min_spanning_tree) = hdbscan(X, **kwargs)
920
921 if self.prediction_data:
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
608 gen_min_span_tree, **kwargs)
609 else:
--> 610 (single_linkage_tree, result_min_span_tree) = memory.cache(
611 _hdbscan_boruvka_kdtree)(X, min_samples, alpha,
612 metric, p, leaf_size,
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
273
274 tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
--> 275 alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
276 leaf_size=leaf_size // 3,
277 approx_min_span_tree=approx_min_span_tree,
hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()
hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
1052
1053 with self._backend.retrieval_context():
-> 1054 self.retrieve()
1055 # Make sure that we get a last message telling us we are done
1056 elapsed_time = time.time() - self._start_time
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in retrieve(self)
931 try:
932 if getattr(self._backend, 'supports_timeout', False):
--> 933 self._output.extend(job.get(timeout=self.timeout))
934 else:
935 self._output.extend(job.get())
/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in result(self, timeout)
443 raise CancelledError()
444 elif self._state == FINISHED:
--> 445 return self.__get_result()
446 else:
447 raise TimeoutError()
/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in __get_result(self)
388 if self._exception:
389 try:
--> 390 raise self._exception
391 finally:
392 # Break a reference cycle with the exception in self._exception
BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
@MaartenGr I think the issue happened here:
self.hdbscan_model.fit(umap_embeddings)
I am not familiar with this issue. There are some related issues here, here, and here that might be relevant to your problem. Just to be sure, did you change any other parameters in UMAP or HDBSCAN, such as
core_dist_n_jobs
orn_jobs
? Also, have you tried training it with a subset of the data?
@MaartenGr Regarding a subset of the data: it seems that when the embedding size is less than 100000
, it runs fine. But if it is bigger than that, the above error arises.
UMAP can be quite expensive with its approximate nearest neighbor search. It is not unsurprising that with 10 million sentences each containing a high-dimensional vector that UMAP may take some time to complete. With respect to the CPU cores, I cannot give a definitive answer without understanding the machine and setup. Having said that, these questions seem to be rather specific to UMAP. I would advise you to post them on the issues page of UMAP as I am quite sure they could provide you with a much better answer than I am.
@MaartenGr I have found out the reason of the hang in this issue.
It turns out it is stuck at UMAP's spectral initialisation stage, and it is very common for UMAP to stuck in the spectral initialisation stage indefinitely if the graph is not connected.
The solution is to either set the init='random'
, or use a custom initialisation matrix and supply it to the init
argument. A recommended initialisation matrix is a PCA-reduced matrix with columns standardised to have 1e-4 standard deviation.
It might be beneficial to have a section in FAQ about this, as it is not a rare incidence.
@MaartenGr I also have fixed the issue in HDBSCAN. Please see the pull request: https://github.com/scikit-learn-contrib/hdbscan/pull/495
Originally, there was a bottleneck in processing large data due to usage of memoryviews in HDBSCAN's joblib. Now I have changed the max_nbytes parameter, the bottleneck should have been removed.
Another quick fix is to not use any multiprocessing at all. But this might slow things down a lot.
@ginward Great work on figuring out where the issue stems from. I am actually rather surprised that these issues are not mentioned more often. It seems that these packages are not often used for rather large datasets for these errors to appear frequently. Hopefully, the HDBSCAN pull request gets merged soon.
@ginward Glad to see the HDBSCAN bug was identified. The bug is still breaking my code; I can't get a workaround. Hopefully they accept the commit soon.
@MaartenGr @ginward For what it's worth, I do not get the BrokenProcessPool
error when running BERTopic on my large dataset in Google Colab. I do get the error on my local Windows machine, which has 32GB of RAM and an RTX 2070 Super that I use to encode my documents.
@MaartenGr @sean-doody The pull request https://github.com/scikit-learn-contrib/hdbscan/pull/495 has been merged.
Glad to hear that the pr has been merged! Most likely, I will wait before getting an official pypi release to an updated HDBSCAN version in the requirements. However, I think I will put this fix in the FAQ for those that are getting this issue.
Due to inactivity, I'll be closing this issue for now. If, however, you want to continue the discussion or re-open the issue, feel free to reach out!
Hey Maarten, Firstly thank you for all the help you have been uptill this point! 👍 👍 👍 I want to visualise the top topics using the same logic you so nicely showed here https://github.com/MaartenGr/BERTopic/issues/126#issuecomment-855606679 -thank you for that. ❤️
however I am a bit curious how one could feed a big data set of sentences to the model without blowing up the memory. can you suggest something? like when we do here:
topics, _ = topic_model.fit_transform(docs)
like how could one feed sentences to the model?the intention in the end is to finally visualise the top topics , something you already showed: https://github.com/MaartenGr/BERTopic/issues/126#issuecomment-855606679 to get a nice visualistion.
Thanks Maarten for everything 🙏