MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

All documents in Topic 0 #1361

Closed eelbeyi closed 1 year ago

eelbeyi commented 1 year ago

Hi,

I'm using cuML since I have a large dataset, around 1 million Reddit posts.

When I use standard methods and parameters as below, I have kind of ok results, but with too many outliers (around %50 of documents), and too many topics (300 was the best result I got after playing with parameters).

However, the same problematic behavior arises when I try adding an extra method, function, or parameter and I get the output with less noise and less topic number (noise is around a couple of thousand documents, and topic numbers are less than 30), but then, almost all the documents are placed in topic 0, so basically the topic model doesn't work.

I have experienced this same behavior with exactly the same dataset and same parameters but with particular cases for example when I add

What might be the reason for that?

MaartenGr commented 1 year ago

When I use standard methods and parameters as below

Could you share your full code? It is not entirely clear to me what parameters you are exactly using.

but with too many outliers (around %50 of documents), and too many topics (300 was the best result I got after playing with parameters).

Note that HDBSCAN tends to generate quite a number of outliers. You can reduce them with .reduce_outliers which should be quite helpful in your case.

With respect to the topics, this is, to a certain extent, expected behavior. With 1 million posts, I would definitely expect to have 300 topics at the very least. Topics can be viewed from different levels of abstraction and the fewer topics you want extracted from those 1 million posts, the more abstract they will be. Moreover, the fewer topics you want from those million posts, the "dirtier" these topics become if you minimize the number of outliers.

If you are looking for something without outliers and few topics, then a clustering algorithm like k-Means would be better suited. However, I do think that generally HDBSCAN generates much more accurate representation due to the inclusion of outliers. Especially since you can use .reduce_outliers to reduce all the outliers without impacting the topic representations, this seems to me like the ideal situation.

eelbeyi commented 1 year ago

Thank you very much for your kind and prompt reply Maarten,

Here is the code I use normally;

`embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') embeddings = embedding_model.encode(documents, show_progress_bar=True)

from cuml.manifold import UMAP

umap_model = UMAP(n_components=5, n_neighbors=25, metric="cosine", verbose=True) reduced_embeddings = umap_model.fit_transform(embeddings)

from cuml.cluster import HDBSCAN

hdbscan_model = HDBSCAN(min_samples=120, gen_min_span_tree=True, prediction_data=True, min_cluster_size=120, verbose=True) clusters = hdbscan_model.fit(reducedembeddings).labels

vectorizer_model=CountVectorizer( vocabulary=vocab, ngram_range=(1, 3), stop_words="english", max_features=10_000, min_df=20) ctfidf_model=ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)

topic_model=BERTopic( embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, calculate_probabilities=True, verbose=True ).fit(documents, embeddings=embeddings) ` With this, the result I have is like this

image

Very coincidentally, while playing with parameters and trying to make sense of the correlation with the outputs, I realized that when I apply some particular methods, functions, or parameters, I got totally different results. (for example random state, cluster selection epsilon, or representations of documents). The most striking one was indeed random_state, which I never thought I would intervene in the outputs of the model.

Here is an example you can see what happens when I applied the representation model without changing anything else and using the same documents and UMAP and HDBSCAN

`from bertopic.representation import KeyBERTInspired from bertopic import BERTopic

representation_model = KeyBERTInspired()

topic_model_rep=BERTopic( embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, representation_model=representation_model, calculate_probabilities=True, verbose=True ).fit(documents, embeddings=embeddings)`

image

It shows exactly the same behavior when I apply random state or cluster selection epsilon, at least these are the ones I witness by now.

MaartenGr commented 1 year ago

The most striking one was indeed random_state, which I never thought I would intervene in the outputs of the model.

This relates to the first question in the FAQ.

Setting a random_state in UMAP is important as that will prevent stochastic behavior which can change the results dramatically if the parameters are chosen such that they are the "edge" of different clustering.

MaartenGr commented 1 year ago

Closing this due to inactivity. Let me know if you want to re-open the issue!