MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.13k stars 763 forks source link

Embedding Space Dimension and HDBSCAN #183

Closed FraAmato closed 3 years ago

FraAmato commented 3 years ago

Hi Maarten, first and foremost, thank you very much for this amazing package!

I'm using it for my master’s thesis, and I would like to ask you some questions regarding some aspects of the package.

The first think I would like to ask you concerns the algorithm choose for the topic clustering: why did you choose the combination of UMAP and HDBSCAN, and specifically this clustering algorithm?

I would like to try to apply a different clustering algorithm on the embedded documents, before the dimension reduction: to do that would be sufficient to work on the embeddings provided by .extract_embeddings() , or that space has a dimension already reduced by UMAP?

Moreover, could you give me any hint on how the dimension of the embedding space is decided by the algorithm (pointing me to a paper would be perfect)?

Thank you very much in advance!

MaartenGr commented 3 years ago

Thank you for the kind words!

The choice of this combination has several reasons. To start with, embeddings are typically of a high dimension (e.g., 300) which makes clustering quite difficult. Even methods that support high dimensionality often works best at lower values. For instance, any clustering method with a cosine similarity measure can handle it to a certain extent but is by no means bullet-proof.

Thus, you can either you a clustering method that works well at high dimensionality, which aren't many out there, or you approach the problem a bit different. Often, you can reduce the dimensionality with methods like PCA or even t-SNE (although it is not recommended) and then apply a clustering algorithm. Here, we are only interested in clustering the documents, so it's okay if some information is lost in reducing dimensions.

Then, we look into the choice of algorithm. In my experience, HDBSCAN typically outperforms most algorithms (not always of course) as it rather flexible also when it comes to the data ingested. Moreover, a hierarchical structure in clustering could help develop BERTopic further as it theoretically can create sub-topics.

If you want to use a different algorithm, I would advise looking at the code below:

https://github.com/MaartenGr/BERTopic/blob/687d84612a62c5eeddcc089a4e29cbb4710ba23e/bertopic/_bertopic.py#L282

_cluster_embeddings is where you will find most of the HDBSCAN-related clustering.

Lastly, the dimension of the embedding space, if you are referring to UMAP, is decided by the user. The default is set at 5 which gives a nice balance between having some information and reducing to a lower dimensionality.

Hopefully, this answers your questions. If not, please let me know!

FraAmato commented 3 years ago

Thank you for the thorough and quick reply!

I understand the choice of reducing dimensionality and then operating on the reduced space for the clustering, I was (and still am) doubtful about UMAP beacuse of its stochastic nature, and would like to try out some other dimensionality reduction techniques in comparison with it. Might I ask you why PCA or t-SNE are not recommended?

Indeed, my idea was to compare different clustering techniques on both the reduced-dimensional and high-dimensional space.

On the dimension of the embedding space I am referring to the dimension of the embeddings outputted by topic_model._extract_embeddings, that is the dimension that are followingly reduced by UMAP. For example my set of documents has embeddings of dimension 384, and I would like to make sense out of that number (e.g. how and why more or less this is the dimension of the embeddings). Could you help me out?

MaartenGr commented 3 years ago

Being doubtful is good! A great start for solid research.

Perhaps I was unclear, PCA is generally fine, whereas t-SNE should definitely not be used for clustering purposes. If you read the original paper, they specifically state that the method is meant for visualization purposes. They choose the student-distribution, in part, because it visualizes the clusters nicely. However, this also means that the distances between points and especially between clusters are not accurate for clustering.

The dimensions of the embeddings extracted by self._extract_embeddings depends on the embedding model that is used. Some might output higher dimensions than others but they are typically of size 384 or 768. If you want to know why the embeddings have that size, I would advise you looking at their point of origin on how they were trained. The default models are sentence-transformers that can be found here.

MaartenGr commented 3 years ago

Since there is no follow-up, I'll close this for now. However, if you run into any issues, please let me know!