MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

help sought to train a big data sentence model (upto 1.5 million sentences) #151

Closed schetudiante closed 1 year ago

schetudiante commented 3 years ago

Hey Maarten, Firstly thank you for all the help you have been uptill this point! 👍 👍 👍 I want to visualise the top topics using the same logic you so nicely showed here https://github.com/MaartenGr/BERTopic/issues/126#issuecomment-855606679 -thank you for that. ❤️
however I am a bit curious how one could feed a big data set of sentences to the model without blowing up the memory. can you suggest something? like when we do here: topics, _ = topic_model.fit_transform(docs) like how could one feed sentences to the model?

the intention in the end is to finally visualise the top topics , something you already showed: https://github.com/MaartenGr/BERTopic/issues/126#issuecomment-855606679 to get a nice visualistion.

Thanks Maarten for everything 🙏

MaartenGr commented 3 years ago

No problem, glad I could be of help!

There are several ways to perform computation with large datasets. First, you can set low_memory to True when instantiating BERTopic. This may prevent blowing up the memory in UMAP.

Second, setting calculate_probabilities to False when instantiating BERTopic prevents a huge document-topic probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.

Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting sparse c-TF-IDF matrix. You can do this as follows:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

The min_df parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.

Lastly, and this is a bit on the nose, simply use a machine with more RAM available. Some machines are simply not meant to process such large datasets or memory-intensive algorithms and using one, if available, could help.

Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.

Hopefully, this helps a bit!

TigerShuai commented 3 years ago

你好大神,我计划对百万级别的文本做模型训练,耗时时间过长,我尝试把文本切割成多个小list,增量迭代进行训练,transform不能帮助我实现对模型增量迭代训练,请问还有其他途径吗?

MaartenGr commented 3 years ago

@TigerShuai Hopefully, Google Translate was accurate in translating your issue. It seems that you want to iteratively train a BERTopic model since you have too many documents that take too long to train.

Unfortunately, this is not supported and is unlikely to be supported in the future as the model performs best when you use all documents. Having said that, I would advise several things. First, make sure you use a strong GPU. This will speed up the training procedure quite a bit. Second, set low_memory=True if you are experiencing memory issues. Third, set calculate_probabilities=False as that is a very slow procedure. Finally, I would advise you to use verbose=True and see where the training slows down. If I know what takes so long, perhaps I can propose a solution!

schetudiante commented 3 years ago

No problem, glad I could be of help!

There are several ways to perform computation with large datasets. First, you can set low_memory to True when instantiating BERTopic. This may prevent blowing up the memory in UMAP.

Second, setting calculate_probabilities to False when instantiating BERTopic prevents a huge document-topic probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.

Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting sparse c-TF-IDF matrix. You can do this as follows:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

The min_df parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.

Lastly, and this is a bit on the nose, simply use a machine with more RAM available. Some machines are simply not meant to process such large datasets or memory-intensive algorithms and using one, if available, could help.

Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.

Hopefully, this helps a bit!

Thank you Maarten for your graciousness! I am just caught about two things:

  1. Could I save the model trained after training completes & then 2. use the saved model (embeddings) to visualise the topics like you say here:

    Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.

Can you if possible: give an example to save the (huge) model and visualise like you say?

It would be an awesome thing (and I already feel like sending you a gift now :) You inspire us!

MaartenGr commented 3 years ago

You can save the model with:

from bertopic import BERTopic
topic_model = BERTopic().fit(docs)
topic_model.save("my_model")

Then, you can use the saved model to visualize the topics. You can find a bit more about saving and loading in the documentation here.

The part about the weighted sample is something you will have to code yourself. Adjust this code to only select a number of documents to visualize (e.g., 100,000 documents instead of 1.3 million). There, you need to add this df = df.sample(n=100_000) directly after df["topic"] = topics to visualize only a sample of the documents.

schetudiante commented 3 years ago

Hey Maarten, Thanks for your 'ever graciousness' big thank you for that Once Again Sir! Can you give a pointer how I can load the documents into the model , because I have 1.5 million sentences (documents) which are in a text file (or I can put them in multiple text files). Because I see that the model takes one argument while loading the documents , like here: topics, _ = topic_model.fit_transform(docs)

What I mean is, before, I would load the documents (upto the size of say, 32000 etc) into a list simply and then put the list as the docs argument, I wonder if the same maybe done for all the 1.5 mil documents or maybe you can suggest some other way.

still waiting on this btw,

It would be an awesome thing (and I already feel like sending you a gift now :) (https://github.com/MaartenGr/BERTopic/issues/151#issuecomment-870529679)

MaartenGr commented 3 years ago

Personally, I would simply put all those 1.5 million sentences (documents) in a list and then put that list as the docs argument. If have you have enough RAM available, this should be no issue. If, however, you run into memory issues, then I would advise you to look here for a few tips on how to run BERTopic on large data.

It would be an awesome thing (and I already feel like sending you a gift now :) (#151 (comment))

Don't worry about that! I'm just glad that I can help out.

ginward commented 2 years ago

Personally, I would simply put all those 1.5 million sentences (documents) in a list and then put that list as the docs argument. If have you have enough RAM available, this should be no issue. If, however, you run into memory issues, then I would advise you to look here for a few tips on how to run BERTopic on large data.

It would be an awesome thing (and I already feel like sending you a gift now :) (#151 (comment))

Don't worry about that! I'm just glad that I can help out.

Is it possible to fit first, and then transform the documents in small chunks (i.e. not use fit_transform but calling fit first and then calling transform on smaller chunks of data). @MaartenGr

If I have 1.5 million of sentences, for example, can I fit with all 1.5 million sentences and then transform 500k sentences at a time for 3 times?

MaartenGr commented 2 years ago

@ginward You can definitely fit the model once on a subset of the data and simply transform for all others. Typically, you can get away with a few hundred thousand documents. You really do not need to train on millions of sentences to improve the model as sufficient data is most likely already given.

Thus, you can fit on 200,000 sentences and simply predict the other 1.3 million sentences.

The only thing that you should take into account is selecting those 200,000 sentences. If you are looking for very specific topics that are likely to only appear a few thousand times, then there is a good chance that you will not capture those in the model. Thus, proper sampling here is key.

ginward commented 2 years ago

@MaartenGr I see. So should I call fit on the 200,000 sentences, and than call transform on the 1.3 million sentences?

ginward commented 2 years ago

But if transform takes a lot of memory, can I transform on smaller chunks (such as several 200,000 chunks summing up to 1.3 million sentences), and then combine the results together?

MaartenGr commented 2 years ago

Yes, you can fit on the 200,00 sentences, and then call transform on the remaining 1.3 million sentences as long as you are sure that the 200,000 sentences are a good representation of the remaining 1.3 million sentences.

The fit stage can take a lot of memory, whereas the transform stage should be much smaller. It should be okay to transform them all at once. However, if you are still experiencing memory issues, there should be no issue in separating them into smaller chunks and combining the results.

ginward commented 2 years ago

@MaartenGr I see. Is it possible to further reduce the memory usage by tuning the hyperparameter of UMAP (i.e. reducing the dimensionality of the document embeddings further) or HDBSCAN (fewer clusters)? And why did you use HDBSCAN for nearest neighbours, but not K-nearest neighbours algorithm?

ginward commented 2 years ago

For example, if K-means consume less memory, can we use k-means instead of HDBSCAN?

ginward commented 2 years ago

It seems that the memory issues occur not in the Sentence Embedding stage or the HMAP stage, but the HDBSCAN stage. I currently have about 10 million short sentences. I think it is the final stage that the memory usage shoots up.

MaartenGr commented 2 years ago

There are a few tricks you can do with respect to UMAP and HDBSCAN, there are outlined here. In practice, there are a number of places where memory consumption may increase (UMAP, HDBSCAN, c-TF-IDF, etc.) that you can optimize in the link above.

Swapping out HDBSCAN for k-Means will reduce in a significantly less accurate model. There are quite a few benefits to HDBSCAN over k-Means including outlier detection, hierarchical nature, density efficiency, etc.

With 10 million sentences, I would advise not to try to optimize the algorithms but focus on the implementation of BERTopic. Like you mentioned, fit on a subset and predict for all others.

ginward commented 2 years ago

@MaartenGr What if I reduce the UMAP reduced dimension to 2 (in the source code it was set to five originally)? Would that relieve some of the burden that HDBSCAN bears?

MaartenGr commented 2 years ago

That would likewise reduce the quality of the model and is not something I would recommend. Since you have millions of data points, I would not advise training on the entire dataset to lower the memory necessity. That seems to be the most efficient way of handling this without the need to optimize/chance/adopt the sub-algorithms.

ginward commented 2 years ago

@MaartenGr Thanks. What is the maximum number of sentences that the model can handle from your experience?

MaartenGr commented 2 years ago

This is a difficult question to answer since it highly depends on your hardware specs. A free google colab session handles a couple of hundred thousand sentences without issues but runs into issues when you approach the million. However, there are plenty of organizations (including where I work currently) that can handle a couple of million sentences without any problems.

Also, it depends on the length of the sentences, the number of words, vocabulary size, etc.

ginward commented 2 years ago

@MaartenGr Is there also a way to separate the process of sentence embedding, UMAP and HDBSCAN by saving the intermediary models? If the memory blows up at the last stage (HDBSCAN), I would need to re-do the sentence embedding and UMAP parts, and it is going to take another few hours.

ginward commented 2 years ago

This is a difficult question to answer since it highly depends on your hardware specs. A free google colab session handles a couple of hundred thousand sentences without issues but runs into issues when you approach the million. However, there are plenty of organizations (including where I work currently) that can handle a couple of million sentences without any problems.

Also, it depends on the length of the sentences, the number of words, vocabulary size, etc.

I currently have a Colab Pro+ subscription with 55GB RAM, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN.

I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.

ginward commented 2 years ago

I am not sure if a single GPU card can use all the 96GB RAM available in the machine, as the other 48GB is in the other three GPU cards. But nevertheless, the model still blows up at the last stage. @MaartenGr

MaartenGr commented 2 years ago

You can try to embed the sentences beforehand by following this piece of documentation. After that, you can simply save the embeddings and load them in when necessary. There currently is not an implementation for UMAP.

I currently have a Colab Pro+ subscription with 55GB memory, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN. I have have a HPC access to a machine with 4 GPUs and 96GB memories in total. But I can only use one GPU and it still blows up the very last stage.

The "last stage" technically is not HDBSCAN but topic extraction with c-TF-IDF and MMR. Having said that, I cannot judge what exactly is happening here without knowing the code you are using. Could you share the code for training BERTopic? Also, if you have set verbose=True, what has it printed until you get the memory issues?

I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.

Do you mean VRAM or RAM? HDBSCAN is not gpu-accelerated.

ginward commented 2 years ago

@MaartenGr If only the sentence transformer part is done on GPU, can I train the embeddings first and then run the other parts on a machine with only CPU access? I have a machine with 128 GB CPU RAM, which might just work OK.

ginward commented 2 years ago

@MaartenGr It is 96GB RAM and 16GB VRAM. Apparently 96GB RAM is not enough to get the 10 million sentences done.

I am using a customised dataset, but the code is here:

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=20)

umap_model = UMAP(n_neighbors=15, n_components=3, min_dist=0.0, metric='cosine', low_memory = True)

# Setting HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=10, umap_model = umap_model,  metric='euclidean', cluster_selection_method='eom', prediction_data=True)

topic_model = BERTopic(verbose=True, seed_topic_list=seed_topic_list, embedding_model="paraphrase-MiniLM-L3-v2", low_memory=True, calculate_probabilities=False, vectorizer_model=vectorizer_model)

I have set the min_df=20, which is a very large threshold.

ginward commented 2 years ago

@MaartenGr Would setting ngram_range=(1, 1) help though? It might reduce the TF-IDF matrix size.

MaartenGr commented 2 years ago

Setting ngram_range=(1,1) would help but reduces the ease of interpretation and topic representation quality since 2-grams often give interesting insights. For millions of sentences, a min_df=20 isn't actually a very large threshold. I think it should pose no issue setting that too at least 100. If you have millions of sentences, then the frequencies of words in your vocab tend to be quite large.

If only the sentence transformer part is done on GPU, can I train the embeddings first and then run the other parts on a machine with only CPU access? I have a machine with 128 GB CPU RAM, which might just work OK.

Yes, only the embedding part benefits from having a GPU.

ginward commented 2 years ago

You can try to embed the sentences beforehand by following this piece of documentation. After that, you can simply save the embeddings and load them in when necessary. There currently is not an implementation for UMAP.

I currently have a Colab Pro+ subscription with 55GB memory, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN. I have have a HPC access to a machine with 4 GPUs and 96GB memories in total. But I can only use one GPU and it still blows up the very last stage.

The "last stage" technically is not HDBSCAN but topic extraction with c-TF-IDF and MMR. Having said that, I cannot judge what exactly is happening here without knowing the code you are using. Could you share the code for training BERTopic? Also, if you have set verbose=True, what has it printed until you get the memory issues?

I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.

Do you mean VRAM or RAM? HDBSCAN is not gpu-accelerated.

I think it crashed at the UMAP stage @MaartenGr :


---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_159881/2313223528.py in <module>
      1 #topics, probs = topic_model.fit_transform(docs)
      2 
----> 3 topic_model = topic_model.fit(docs, embeddings)

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit(self, documents, embeddings, y)
    210         ```
    211         """
--> 212         self.fit_transform(documents, embeddings, y)
    213         return self
    214 

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y)
    286         if self.seed_topic_list is not None and self.embedding_model is not None:
    287             y, embeddings = self._guided_topic_modeling(embeddings)
--> 288         umap_embeddings = self._reduce_dimensionality(embeddings, y)
    289 
    290         # Cluster UMAP embeddings with HDBSCAN

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in _reduce_dimensionality(self, embeddings, y)
   1364                                    low_memory=self.low_memory).fit(embeddings, y=y)
   1365         else:
-> 1366             self.umap_model.fit(embeddings, y=y)
   1367         umap_embeddings = self.umap_model.transform(embeddings)
   1368         logger.info("Reduced dimensionality with UMAP")

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in fit(self, X, y)
   2551 
   2552         if self.transform_mode == "embedding":
-> 2553             self.embedding_, aux_data = self._fit_embed_data(
   2554                 self._raw_data[index], n_epochs, init, random_state,  # JH why raw data?
   2555             )

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in _fit_embed_data(self, X, n_epochs, init, random_state)
   2578         replaced by subclasses.
   2579         """
-> 2580         return simplicial_set_embedding(
   2581             X,
   2582             self.graph_,

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose)
   1052     elif isinstance(init, str) and init == "spectral":
   1053         # We add a little noise to avoid local minima for optimization to come
-> 1054         initialisation = spectral_layout(
   1055             data,
   1056             graph,

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    299 
    300     if n_components > 1:
--> 301         return multi_component_layout(
    302             data,
    303             graph,

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/spectral.py in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
    236         num_lanczos_vectors = max(2 * k + 1, int(np.sqrt(component_graph.shape[0])))
    237         try:
--> 238             eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
    239                 L,
    240                 k,

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1682             raise ValueError("unrecognized mode '%s'" % mode)
   1683 
-> 1684     params = _SymmetricArpackParams(n, k, A.dtype.char, matvec, mode,
   1685                                     M_matvec, Minv_matvec, sigma,
   1686                                     ncv, v0, maxiter, which, tol)

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in __init__(self, n, k, tp, matvec, mode, M_matvec, Minv_matvec, sigma, ncv, v0, maxiter, which, tol)
    510             raise ValueError("k must be less than ndim(A), k=%d" % k)
    511 
--> 512         _ArpackParams.__init__(self, n, k, tp, mode, sigma,
    513                                ncv, v0, maxiter, which, tol)
    514 

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in __init__(self, n, k, tp, mode, sigma, ncv, v0, maxiter, which, tol)
    340         ncv = min(ncv, n)
    341 
--> 342         self.v = np.zeros((n, ncv), tp)  # holds Ritz vectors
    343         self.iparam = np.zeros(11, arpack_int)
    344 

MemoryError: Unable to allocate 294. GiB for an array with shape (11580087, 3402) and data type float64
ginward commented 2 years ago

@MaartenGr What I don't understand is why the dimensionality of the UMAP matrix is (11580087, 3402). I understand that 11580087 is the document size, but shouldn't the size of the embedding be 382? How come it is 3402?

ginward commented 2 years ago

I also see here that the official website recommends the following procedure::

Consider a typical pipeline: high-dimensional embedding (300+) => PCA to reduce to 50 dimensions => UMAP to reduce to 10-20 dimensions => HDBSCAN for clustering / some plain algorithm for classification.

Which means adding a PCA layer before the UMAP layer. Would that sufficiently reduce data usage? @MaartenGr

MaartenGr commented 2 years ago

For the specific regarding UMAP and it's intermediate steps I would refer you to the corresponding issues page of UMAP.

Introducing PCA into the pipeline is actually likely to degrade performance. Although PCA can remove noise which might actually benefit UMAP, there is a high possibility of removing too much information in the reduction process. For example, if you have embeddings of size 368, then what should the reduced dimensionality be before applying UMAP? And what if the initial dimensionality increases? In other words, it becomes difficult to guarantee some stability of BERTopic.

Having said that, you are free to try it out yourself. However, a smaller dataset, which is already quite large, should already be more than sufficient in capturing the necessary information/clusters.

ginward commented 2 years ago

@MaartenGr I currently found a machine with 1TB of RAM, and it is no longer throwing memory errors on my 10 million sentences. However, it seems that it takes a very long time to run the UMAP step, even if I use the PCA to reduce the dimensionality to 50 before UMAP. Any idea why?

ginward commented 2 years ago

I also have 128 CPU cores on the machine, but it seems that most CPU resources are just idle.

MaartenGr commented 2 years ago

UMAP can be quite expensive with its approximate nearest neighbor search. It is not unsurprising that with 10 million sentences each containing a high-dimensional vector that UMAP may take some time to complete. With respect to the CPU cores, I cannot give a definitive answer without understanding the machine and setup. Having said that, these questions seem to be rather specific to UMAP. I would advise you to post them on the issues page of UMAP as I am quite sure they could provide you with a much better answer than I am.

ginward commented 2 years ago

@MaartenGr After setting init to random in UMAP, I seem to have passed UMAP's initialisation stage. However, it throws the following error in HDBSCAN now:

Any idea why?

This is related to this issue.

UMAP(angular_rp_forest=True, dens_frac=0.0, dens_lambda=0.0, init='random',
     low_memory=False, metric='cosine', min_dist=0.0, n_components=5,
     verbose=True)
Construct fuzzy simplicial set
Fri Oct  1 05:16:34 2021 Finding Nearest Neighbors
Fri Oct  1 05:16:36 2021 Building RP forest with 64 trees
Fri Oct  1 05:24:38 2021 NN descent for 23 iterations
     1  /  23
     2  /  23
     3  /  23
     4  /  23
     5  /  23
     6  /  23
     7  /  23
     8  /  23
     9  /  23
     10  /  23
    Stopping threshold met -- exiting after 10 iterations
Fri Oct  1 05:56:27 2021 Finished Nearest Neighbor Search
Fri Oct  1 05:57:32 2021 Construct embedding
    completed  0  /  200 epochs
    completed  20  /  200 epochs
    completed  40  /  200 epochs
    completed  60  /  200 epochs
    completed  80  /  200 epochs
    completed  100  /  200 epochs
    completed  120  /  200 epochs
    completed  140  /  200 epochs
    completed  160  /  200 epochs
    completed  180  /  200 epochs
Fri Oct  1 06:31:02 2021 Finished embedding
2021-10-01 06:31:57,225 - BERTopic - Reduced dimensionality with UMAP
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
  File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

BrokenProcessPool                         Traceback (most recent call last)
/tmp/ipykernel_778601/2313223528.py in <module>
      1 #topics, probs = topic_model.fit_transform(docs)
      2 
----> 3 topic_model = topic_model.fit(docs, embeddings)

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py in fit(self, documents, embeddings, y)
    210         ```
    211         """
--> 212         self.fit_transform(documents, embeddings, y)
    213         return self
    214 

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y)
    289 
    290         # Cluster UMAP embeddings with HDBSCAN
--> 291         documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)
    292 
    293         # Sort and Map Topic IDs by their frequency

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py in _cluster_embeddings(self, umap_embeddings, documents)
   1384             probabilities: The distribution of probabilities
   1385         """
-> 1386         self.hdbscan_model.fit(umap_embeddings)
   1387         documents['Topic'] = self.hdbscan_model.labels_
   1388         probabilities = self.hdbscan_model.probabilities_

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    917          self._condensed_tree,
    918          self._single_linkage_tree,
--> 919          self._min_spanning_tree) = hdbscan(X, **kwargs)
    920 
    921         if self.prediction_data:

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    608                                            gen_min_span_tree, **kwargs)
    609             else:
--> 610                 (single_linkage_tree, result_min_span_tree) = memory.cache(
    611                     _hdbscan_boruvka_kdtree)(X, min_samples, alpha,
    612                                              metric, p, leaf_size,

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    273 
    274     tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
--> 275     alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
    276                                  leaf_size=leaf_size // 3,
    277                                  approx_min_span_tree=approx_min_span_tree,

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in result(self, timeout)
    443                     raise CancelledError()
    444                 elif self._state == FINISHED:
--> 445                     return self.__get_result()
    446                 else:
    447                     raise TimeoutError()

/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in __get_result(self)
    388         if self._exception:
    389             try:
--> 390                 raise self._exception
    391             finally:
    392                 # Break a reference cycle with the exception in self._exception

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
MaartenGr commented 2 years ago

I am not familiar with this issue. There are some related issues here, here, and here that might be relevant to your problem. Just to be sure, did you change any other parameters in UMAP or HDBSCAN, such as core_dist_n_jobs or n_jobs? Also, have you tried training it with a subset of the data?

ginward commented 2 years ago

@MaartenGr I think the issue happened here:

self.hdbscan_model.fit(umap_embeddings)

ginward commented 2 years ago

I am not familiar with this issue. There are some related issues here, here, and here that might be relevant to your problem. Just to be sure, did you change any other parameters in UMAP or HDBSCAN, such as core_dist_n_jobs or n_jobs? Also, have you tried training it with a subset of the data?

@MaartenGr Regarding a subset of the data: it seems that when the embedding size is less than 100000, it runs fine. But if it is bigger than that, the above error arises.

ginward commented 2 years ago

UMAP can be quite expensive with its approximate nearest neighbor search. It is not unsurprising that with 10 million sentences each containing a high-dimensional vector that UMAP may take some time to complete. With respect to the CPU cores, I cannot give a definitive answer without understanding the machine and setup. Having said that, these questions seem to be rather specific to UMAP. I would advise you to post them on the issues page of UMAP as I am quite sure they could provide you with a much better answer than I am.

@MaartenGr I have found out the reason of the hang in this issue.

It turns out it is stuck at UMAP's spectral initialisation stage, and it is very common for UMAP to stuck in the spectral initialisation stage indefinitely if the graph is not connected.

The solution is to either set the init='random', or use a custom initialisation matrix and supply it to the init argument. A recommended initialisation matrix is a PCA-reduced matrix with columns standardised to have 1e-4 standard deviation.

It might be beneficial to have a section in FAQ about this, as it is not a rare incidence.

ginward commented 2 years ago

@MaartenGr I also have fixed the issue in HDBSCAN. Please see the pull request: https://github.com/scikit-learn-contrib/hdbscan/pull/495

Originally, there was a bottleneck in processing large data due to usage of memoryviews in HDBSCAN's joblib. Now I have changed the max_nbytes parameter, the bottleneck should have been removed.

Another quick fix is to not use any multiprocessing at all. But this might slow things down a lot.

MaartenGr commented 2 years ago

@ginward Great work on figuring out where the issue stems from. I am actually rather surprised that these issues are not mentioned more often. It seems that these packages are not often used for rather large datasets for these errors to appear frequently. Hopefully, the HDBSCAN pull request gets merged soon.

sean-doody commented 2 years ago

@ginward Glad to see the HDBSCAN bug was identified. The bug is still breaking my code; I can't get a workaround. Hopefully they accept the commit soon.

sean-doody commented 2 years ago

@MaartenGr @ginward For what it's worth, I do not get the BrokenProcessPool error when running BERTopic on my large dataset in Google Colab. I do get the error on my local Windows machine, which has 32GB of RAM and an RTX 2070 Super that I use to encode my documents.

ginward commented 2 years ago

@MaartenGr @sean-doody The pull request https://github.com/scikit-learn-contrib/hdbscan/pull/495 has been merged.

MaartenGr commented 2 years ago

Glad to hear that the pr has been merged! Most likely, I will wait before getting an official pypi release to an updated HDBSCAN version in the requirements. However, I think I will put this fix in the FAQ for those that are getting this issue.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue for now. If, however, you want to continue the discussion or re-open the issue, feel free to reach out!