Problem with model.transform() after Outlier reduction procedure

citteriomatteo commented 1 year ago

Hello, I am trying to extract topics from a list of texts. Since my data probably lacks a bit of quality and half of the texts were classified as outliers, I proceeded with the outlier reduction stage after the topic extraction one. This is the code:

self.model = BERTopic(
                language="italian",
                top_n_words=10,
                n_gram_range=(1, 1),
                min_topic_size=40,
                nr_topics="auto",
                embedding_model=embedding_model,
                seed_topic_list=seed_topic_list,
                calculate_probabilities=False,
                # verbose=self.verbose
)
# Fit the model to the documents
topics, _ = self.model.fit_transform(self.history_tickets)

# Use this method to reduce the number of outliers taken, and get the new topics
new_topics = self.model.reduce_outliers(self.history_tickets, topics, strategy="c-tf-idf")
# Then, update the topics to the ones that considered the new data
self.model.update_topics(self.history_tickets, topics=new_topics)

And it works fine. However, when I proceed with the extraction of the topic of a single text (that was already part of self.history_tickets):

prediction = self.model.transform([text])[0][0]

the prediction is in most cases -1. What is the problem? Should I proceed with reduce_outliers also after the single prediction?

Thanks in advance,

MaartenGr commented 1 year ago

The underlying cluster model, HDBSCAN, has a tendency to assign unseen documents to the outlier class when using its internal .predict like function. You can choose to use either .reduce_outliers or you can save the model with safetensors and load it in. That removes the underlying cluster model and reassigns the way predictions are made.

citteriomatteo commented 1 year ago

Thank you for the response. I am trying to use .reduce_outliers after the .transform call:

def get_message_topic(self, message, preprocess=True):
        """
        Returns the topic prediction related to the input message (used when a new (message,feedback) arrives).
        :param preprocess: to choose whether to preprocess the input or not (bool)
        :param message: message (str)
        :return: prediction of the input message (DataFrame)
        """
        if preprocess:
            message = preprocess_text(message)

        # predict new text's topics with BERTopic
        prediction = self.model.transform([message])

        prediction = self.model.reduce_outliers([message], [prediction], strategy="distributions")

        if self.verbose:
            logger.info(f"Topic for the message {message} is: {prediction[0][0]}")

            return prediction[0][0]

but, when this method is called, this error raises:

File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\codega\drift\topic_modeling\topic_modeling.py", line 176, in get_message_topic prediction = self.model.reduce_outliers([message], [prediction], strategy="distributions") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\bertopic_bertopic.py", line 2109, in reduce_outliers topicdistr, = self.approximate_distribution(outlier_docs, min_similarity=threshold, **distributions_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\bertopic_bertopic.py", line 1241, in approximate_distribution topic_distributions = np.vstack(topic_distributions) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<__array_function__ internals>", line 180, in vstack File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\numpy\core\shape_base.py", line 282, in vstack return _nx.concatenate(arrs, 0) ^^^^^^^^^^^^^^^^^^^^^^^^ File "<__array_function__ internals>", line 180, in concatenate ValueError: need at least one array to concatenate

MaartenGr commented 1 year ago

You should check the output of .transform and what is contained inside prediction. .transform returns a tuple of prediction and probabilities which you cannot give to .reduce_outliers as a single variable.

MaartenGr / BERTopic

Problem with model.transform() after Outlier reduction procedure #1507