Open citteriomatteo opened 1 year ago
The underlying cluster model, HDBSCAN, has a tendency to assign unseen documents to the outlier class when using its internal .predict
like function. You can choose to use either .reduce_outliers
or you can save the model with safetensors
and load it in. That removes the underlying cluster model and reassigns the way predictions are made.
Thank you for the response.
I am trying to use .reduce_outliers
after the .transform
call:
def get_message_topic(self, message, preprocess=True):
"""
Returns the topic prediction related to the input message (used when a new (message,feedback) arrives).
:param preprocess: to choose whether to preprocess the input or not (bool)
:param message: message (str)
:return: prediction of the input message (DataFrame)
"""
if preprocess:
message = preprocess_text(message)
# predict new text's topics with BERTopic
prediction = self.model.transform([message])
prediction = self.model.reduce_outliers([message], [prediction], strategy="distributions")
if self.verbose:
logger.info(f"Topic for the message {message} is: {prediction[0][0]}")
return prediction[0][0]
but, when this method is called, this error raises:
File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\codega\drift\topic_modeling\topic_modeling.py", line 176, in get_message_topic prediction = self.model.reduce_outliers([message], [prediction], strategy="distributions") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\bertopic_bertopic.py", line 2109, in reduce_outliers topicdistr, = self.approximate_distribution(outlier_docs, min_similarity=threshold, **distributions_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\bertopic_bertopic.py", line 1241, in approximate_distribution topic_distributions = np.vstack(topic_distributions) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<__array_function__ internals>", line 180, in vstack File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\numpy\core\shape_base.py", line 282, in vstack return _nx.concatenate(arrs, 0) ^^^^^^^^^^^^^^^^^^^^^^^^ File "<__array_function__ internals>", line 180, in concatenate ValueError: need at least one array to concatenate
You should check the output of .transform
and what is contained inside prediction
. .transform
returns a tuple of prediction and probabilities which you cannot give to .reduce_outliers
as a single variable.
Hello, I am trying to extract topics from a list of texts. Since my data probably lacks a bit of quality and half of the texts were classified as outliers, I proceeded with the outlier reduction stage after the topic extraction one. This is the code:
And it works fine. However, when I proceed with the extraction of the topic of a single text (that was already part of self.history_tickets):
prediction = self.model.transform([text])[0][0]
the prediction is in most cases -1. What is the problem? Should I proceed with reduce_outliers also after the single prediction?
Thanks in advance,