Closed stevetracvc closed 1 year ago
The outlier reduction method does not update the topics nor the internal assignment of topics. You would indeed have to use outlier reduction after transform if necessary. The reason for this boils down to modularity. You might want to test out different strategies or even have different strategies between fit and transform.
With respect to the probabilities, this is actually to be expected. They are generated with HDBSCAN which calculates probabilities after generating the clusters. In other words, they are an approximation of its internal structure. You would have to use a different clustering algorithm that relies on the probabilities directly to have them the same.
Maarten, thanks for the response. I hadn't thought of using the outlier reduction on newly-transformed results.
I started working on the second part of your answer, but then got distracted with work. I'm still unclear on something but I'll post a follow up when I have suitable code
Is there a reason why the default action of update_topics is to overwrite the existing models? Ie,
self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=n_gram_range)
self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()
self.representation_model = representation_model
rather than only overwriting if new models are provided?
self.vectorizer_model = vectorizer_model or self.vectorizer_model
self.ctfidf_model = ctfidf_model or self.ctfidf_model
self.representation_model = representation_model or self.representation_model
The reason for this is that .update_topics
defaults back to the original representation when you initialize the topic model without tuning any parameters. Essentially, it is a way to return your topic model to its original representation without any tuning. This would not be easily possible if they would not overwrite the given representation models.
I don't know if this is an error in the code, or an error in my expectation of what should happen after outlier reduction.
I wrote a test_reduce_outliers.py as shown below
My expectation is that, after outlier reduction, then the model will be more lenient when classifying a document. However, if you use outlier reduction, then call transform with one of the documents that originally was an outlier, then it is still classified as an outlier (asserts 1 & 2). So what is the most appropriate way to handle transforming new documents after using outlier reduction? If using the probabilities method, then it seems like I can just take the argmax of the probabilities. But what if I'm using the embeddings method instead? How would I correctly place a new document? It appears that I'd have to recreate what's done in the reduce_outliers function...?
And even more baffling, when using the "probabilities" method for outlier reduction, the topic chosen after outlier reduction is not the same topic as the topic with the highest probability (third assert), though I think this is related to embedding calculations. eg,
is False, which means when transforming a list of documents, the surrounding documents seem to influence the embeddings (which seems to be an issue with SentenceTransformers rather than BERTopic)