MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Question on How to Predict Label for Single New Documents #696

Closed ynusinovich closed 1 year ago

ynusinovich commented 2 years ago

What is the correct way to predict a label for new documents if I have a fit topic_model? If I use transform on new documents, it always returns a label of -1. Is it more correct to use find_topics to predict a label for new documents?

My code:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import pandas as pd

example_df = pd.read_csv("./example_data/example_df.csv", index_col = 0)

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar=True)
topic_model = BERTopic(n_gram_range=(1, 2), min_topic_size = 3).fit(documents = list(example_df["Notes"]), embeddings = embeddings)

Afterwards, if I do topic_model.transform on the original list of documents, I get reasonable labels for each of them. However, if I do topic_model.transform on any other document(s), I always get -1 (even if I use just part of the documents from the example_df["Notes"] list, which I had trained on). Shouldn't topic_model.transform work for any new documents once it's already fit?

I tried topic_model.find_topics (without a custom embedding model) for comparison, and this does get reasonable labels for new documents. But is that the only way to label new documents, rather than a topic_model.transform?

ynusinovich commented 2 years ago

To be more clear: topic_model.transform(documents = list(example_df["Notes"]), embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar=True)) returns [3, 4, 3, 3, 3, 0, -1, 1, -1, 2, 2, 2, 2, 2, ...] topic_model.transform(documents = list(example_df["Notes"])[:-2], embeddings = sentence_model.encode(list(example_df["Notes"])[:-2], show_progress_bar=True)) returns [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...]

ynusinovich commented 2 years ago

One thing that worked for me was to replace HDBSCAN with KMeans, though I lost the ability for the model to automatically figure out the number of clusters. This solution may be related to #695 though mine is a slightly different situation:

cluster_model = KMeans(n_clusters=10)
vectorizer_model=CountVectorizer(stop_words="english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar=True)
topic_model = BERTopic(n_gram_range = (1, 2), min_topic_size = 3, hdbscan_model = cluster_model, vectorizer_model = vectorizer_model).fit(documents = list(example_df["Notes"]), embeddings = embeddings)

topic_model.transform(documents = list(example_df["Notes"])[:-2], embeddings = sentence_model.encode(list(example_df["Notes"])[:-2], show_progress_bar=True)) now returns [4, 3, 4, 4, 9, 5, 7, 0, 7, ...].

Please let me know if there is a way to topic_model.transform without removing HDBSCAN though. If there isn't, I can just close this one.

MaartenGr commented 2 years ago

At times, using HDBSCAN for unseen documents can be a bit tricky as it indeed has the tendency to predict outliers. This can happen if limited data in the transform stage is given as it is typically more accurate with more information. In practice, there are a number of things you can try. First, you can try to reduce the number of outliers when using HDBSCAN as suggested here. Second, there might be an issue with your installation of HDBSCAN as only getting -1 topics is rather strange. Perhaps you have an older version that might need to be updated. Third, you could still use HDBSCAN to figure out a good number of clusters and then use k-Means with that number of clusters. Although the results would be different, there might still be significant overlap.

Also, do note that find_topics works differently from fit and the results that you get from the latter will not necessarily match the former.

wouterk1MSS commented 2 years ago

Hi - I came across this Q as I had the same problem, all -1s.

At the risk of missing something fundamental, why does new data have to be clustered? I naively assumed that my embeddings would be transformed the same way and compared to the clusters of the trained data. Shouldn't default behaviour allow to predict the nearest cluster for a single input? The current assumption Seems to be that every new dataset you use .transform on has exactly the same properties as the training in distribution, relative closeness etc so that the result of clusterings magically end up the same. this sounds unlikely? Is this behaviour the same for HDB vs kmeans? Thanks for clarifying

MaartenGr commented 2 years ago

@wouterk1MSS

At the risk of missing something fundamental, why does new data have to be clustered? I naively assumed that my embeddings would be transformed the same way and compared to the clusters of the trained data. Shouldn't default behaviour allow to predict the nearest cluster for a single input?

In general, this highly depends on the clustering algorithm that you are using. Some algorithms approach the clustering task as something that is a bit far from a classification task and end up not having .transform or .predict functions that allow unseen data to be grouped into existing clusters. Some algorithms assume that whenever you see new data, that it will impact the structure of the clusters and therefore might adjust all existing clusters. You can find more about how HDBSCAN approaches that here.

The current assumption Seems to be that every new dataset you use .transform on has exactly the same properties as the training in distribution, relative closeness etc so that the result of clusterings magically end up the same. this sounds unlikely?

This is not the case here. The underlying philosophy of BERTopic is that developers know best what kind of clustering approach works best for their use case. HDBSCAN generates great out-of-the-box results but can be difficult to use with respect to predictions. If a cluster model is needed for a specific use-case that has a more robust .transform or .predict function, that can be used without too many issues.

Hi - I came across this Q as I had the same problem, all -1s.

Although I am not entirely sure, this might be an issue with the environment that you are working in. Starting from a completely fresh environment might resolve the issue you are having.

More concretely, when I run the following, I get non-outlier topics:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
pred_topics, pred_probs = topic_model.transform(docs[:-2])

So there might also be something else happening here since it seems to be working for me.

wouterk1MSS commented 2 years ago

Hi @MaartenGr - Thanks a lot for your response and clarification.

Regarding the unexpected -1's output; I've narrowed this down a bit. It seems to be due to instantiating the BERTopic class multiple times. I don't know how this could affect each other. Basically:

topic_model1 = BERTopic()
topics1, probs2 = topic_model1.fit_transform(docs)
--> some result (looks ok)

topic_model2 = BERTopic()
topics1, probs2 = topic_model2.fit_transform(docs_new)
--> some result (looks ok)

topics3, probs3 = topic_model1.transform(docs_new)
--> only -1's

and if we then reinstantiate the first:
topic_model1 = BERTopic()
topics3, probs3 = topic_model1.fit_transform(docs_new)
--> some result (looks ok)

This shouldn't happen right?

MaartenGr commented 2 years ago

@wouterk1MSS From your code, it does not seem to be due to how you instantiate BERTopic but the difference between .fit_transform and .fit + .transform. As mentioned in the post above, this might be HDBSCAN-specific as its prediction can differ from how it was trained. Moreover, there could be something wrong with your environment or installation of HDBSCAN? Could you try starting from a completely fresh environment? Also, could you try using a different model than HDBSCAN and see if that works?

ynusinovich commented 2 years ago

@MaartenGr thank you very much for the help. I checked, and a couple of entries weren't -1 when I did .transform for a subset of the original data (as in the example above), it was only the vast majority of them. The code you sent a link to about reducing the number of outliers didn't work either. However, two things did work: (1) Using HDBSCAN to get the number of clusters and then using KMeans in my topic_model worked well. (2) Using .transform on totally new data (rather than a subset of the original set of data) with the same topic_model as in the example above worked, and didn't give mostly -1. So I think the issue is solved. I can close it unless you're still solving things with the other user.

Code I ended up sticking with for now, as per (1):

cluster_model = HDBSCAN(min_cluster_size=10, metric='euclidean', 
                        cluster_selection_method='eom', prediction_data=False, min_samples=5)
vectorizer_model = CountVectorizer(stop_words = "english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar = True)
topic_model = BERTopic(n_gram_range = (1, 2), min_topic_size = 3, seed_topic_list = seed_topic_list,
                       vectorizer_model = vectorizer_model).fit(documents = list(example_df["Notes"]), embeddings = embeddings)

cluster_model = KMeans(n_clusters = topic_model.hdbscan_model.labels_.max() + 1)
vectorizer_model = CountVectorizer(stop_words = "english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar = True)
topic_model = BERTopic(n_gram_range = (1, 2), min_topic_size = 3, seed_topic_list = seed_topic_list,
                       hdbscan_model = cluster_model, vectorizer_model = vectorizer_model).fit(documents = list(example_df["Notes"]), embeddings = embeddings)

topic_model.save("./models/topic_model_kmeans")