Closed ynusinovich closed 1 year ago
To be more clear:
topic_model.transform(documents = list(example_df["Notes"]), embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar=True))
returns [3, 4, 3, 3, 3, 0, -1, 1, -1, 2, 2, 2, 2, 2, ...]
topic_model.transform(documents = list(example_df["Notes"])[:-2], embeddings = sentence_model.encode(list(example_df["Notes"])[:-2], show_progress_bar=True))
returns [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...]
One thing that worked for me was to replace HDBSCAN with KMeans, though I lost the ability for the model to automatically figure out the number of clusters. This solution may be related to #695 though mine is a slightly different situation:
cluster_model = KMeans(n_clusters=10)
vectorizer_model=CountVectorizer(stop_words="english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar=True)
topic_model = BERTopic(n_gram_range = (1, 2), min_topic_size = 3, hdbscan_model = cluster_model, vectorizer_model = vectorizer_model).fit(documents = list(example_df["Notes"]), embeddings = embeddings)
topic_model.transform(documents = list(example_df["Notes"])[:-2], embeddings = sentence_model.encode(list(example_df["Notes"])[:-2], show_progress_bar=True))
now returns [4, 3, 4, 4, 9, 5, 7, 0, 7, ...]
.
Please let me know if there is a way to topic_model.transform
without removing HDBSCAN though. If there isn't, I can just close this one.
At times, using HDBSCAN for unseen documents can be a bit tricky as it indeed has the tendency to predict outliers. This can happen if limited data in the transform
stage is given as it is typically more accurate with more information. In practice, there are a number of things you can try. First, you can try to reduce the number of outliers when using HDBSCAN as suggested here. Second, there might be an issue with your installation of HDBSCAN as only getting -1 topics is rather strange. Perhaps you have an older version that might need to be updated. Third, you could still use HDBSCAN to figure out a good number of clusters and then use k-Means with that number of clusters. Although the results would be different, there might still be significant overlap.
Also, do note that find_topics
works differently from fit
and the results that you get from the latter will not necessarily match the former.
Hi - I came across this Q as I had the same problem, all -1s.
At the risk of missing something fundamental, why does new data have to be clustered? I naively assumed that my embeddings would be transformed the same way and compared to the clusters of the trained data. Shouldn't default behaviour allow to predict the nearest cluster for a single input? The current assumption Seems to be that every new dataset you use .transform on has exactly the same properties as the training in distribution, relative closeness etc so that the result of clusterings magically end up the same. this sounds unlikely? Is this behaviour the same for HDB vs kmeans? Thanks for clarifying
@wouterk1MSS
At the risk of missing something fundamental, why does new data have to be clustered? I naively assumed that my embeddings would be transformed the same way and compared to the clusters of the trained data. Shouldn't default behaviour allow to predict the nearest cluster for a single input?
In general, this highly depends on the clustering algorithm that you are using. Some algorithms approach the clustering task as something that is a bit far from a classification task and end up not having .transform
or .predict
functions that allow unseen data to be grouped into existing clusters. Some algorithms assume that whenever you see new data, that it will impact the structure of the clusters and therefore might adjust all existing clusters. You can find more about how HDBSCAN approaches that here.
The current assumption Seems to be that every new dataset you use .transform on has exactly the same properties as the training in distribution, relative closeness etc so that the result of clusterings magically end up the same. this sounds unlikely?
This is not the case here. The underlying philosophy of BERTopic is that developers know best what kind of clustering approach works best for their use case. HDBSCAN generates great out-of-the-box results but can be difficult to use with respect to predictions. If a cluster model is needed for a specific use-case that has a more robust .transform
or .predict
function, that can be used without too many issues.
Hi - I came across this Q as I had the same problem, all -1s.
Although I am not entirely sure, this might be an issue with the environment that you are working in. Starting from a completely fresh environment might resolve the issue you are having.
More concretely, when I run the following, I get non-outlier topics:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
pred_topics, pred_probs = topic_model.transform(docs[:-2])
So there might also be something else happening here since it seems to be working for me.
Hi @MaartenGr - Thanks a lot for your response and clarification.
Regarding the unexpected -1's output; I've narrowed this down a bit. It seems to be due to instantiating the BERTopic class multiple times. I don't know how this could affect each other. Basically:
topic_model1 = BERTopic()
topics1, probs2 = topic_model1.fit_transform(docs)
--> some result (looks ok)
topic_model2 = BERTopic()
topics1, probs2 = topic_model2.fit_transform(docs_new)
--> some result (looks ok)
topics3, probs3 = topic_model1.transform(docs_new)
--> only -1's
and if we then reinstantiate the first:
topic_model1 = BERTopic()
topics3, probs3 = topic_model1.fit_transform(docs_new)
--> some result (looks ok)
This shouldn't happen right?
@wouterk1MSS From your code, it does not seem to be due to how you instantiate BERTopic but the difference between .fit_transform
and .fit
+ .transform
. As mentioned in the post above, this might be HDBSCAN-specific as its prediction can differ from how it was trained. Moreover, there could be something wrong with your environment or installation of HDBSCAN? Could you try starting from a completely fresh environment? Also, could you try using a different model than HDBSCAN and see if that works?
@MaartenGr thank you very much for the help.
I checked, and a couple of entries weren't -1
when I did .transform
for a subset of the original data (as in the example above), it was only the vast majority of them.
The code you sent a link to about reducing the number of outliers didn't work either.
However, two things did work:
(1) Using HDBSCAN
to get the number of clusters and then using KMeans
in my topic_model
worked well.
(2) Using .transform
on totally new data (rather than a subset of the original set of data) with the same topic_model
as in the example above worked, and didn't give mostly -1
.
So I think the issue is solved. I can close it unless you're still solving things with the other user.
Code I ended up sticking with for now, as per (1):
cluster_model = HDBSCAN(min_cluster_size=10, metric='euclidean',
cluster_selection_method='eom', prediction_data=False, min_samples=5)
vectorizer_model = CountVectorizer(stop_words = "english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar = True)
topic_model = BERTopic(n_gram_range = (1, 2), min_topic_size = 3, seed_topic_list = seed_topic_list,
vectorizer_model = vectorizer_model).fit(documents = list(example_df["Notes"]), embeddings = embeddings)
cluster_model = KMeans(n_clusters = topic_model.hdbscan_model.labels_.max() + 1)
vectorizer_model = CountVectorizer(stop_words = "english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(list(example_df["Notes"]), show_progress_bar = True)
topic_model = BERTopic(n_gram_range = (1, 2), min_topic_size = 3, seed_topic_list = seed_topic_list,
hdbscan_model = cluster_model, vectorizer_model = vectorizer_model).fit(documents = list(example_df["Notes"]), embeddings = embeddings)
topic_model.save("./models/topic_model_kmeans")
What is the correct way to predict a label for new documents if I have a
fit
topic_model
? If I usetransform
on new documents, it always returns a label of-1
. Is it more correct to usefind_topics
to predict a label for new documents?My code:
Afterwards, if I do
topic_model.transform
on the original list of documents, I get reasonable labels for each of them. However, if I dotopic_model.transform
on any other document(s), I always get-1
(even if I use just part of the documents from the example_df["Notes"] list, which I had trained on). Shouldn'ttopic_model.transform
work for any new documents once it's alreadyfit
?I tried
topic_model.find_topics
(without a custom embedding model) for comparison, and this does get reasonable labels for new documents. But is that the only way to label new documents, rather than atopic_model.transform
?