How to get top N topics for a single document.

VioletRaven commented 1 year ago

I want to ask some simple questions. After fitting, transforming, reducing the outliers, and updating the topic model this way:

embedding = ...
topics, probs = model.fit_transform(docs, embeddings = emb)
new_topics = model.reduce_outliers(docs, topics)
model.update_topics(new_docs, topics=new_topics)

I want to predict one single new document this way: prediction= model.transform(documents=new_document, embeddings=embedding) The result is the following: ([-1], array([[1.16350449e-02, 2.01218509e-01, 1.11319454e-02, 7.07114037e-04, 3.97837834e-03, 2.08542765e-04, 3.01921767e-03, 7.32074922e-02, 4.58492993e-01, 9.87979027e-03, 9.41436941e-03, 1.09079184e-02, 1.02977729e-02, 9.64706333e-03, 1.12269956e-02, 1.04989969e-02, 3.24051104e-03, 1.23429396e-02, 1.00992529e-02, 9.99152714e-03, 1.06927620e-02, 1.20518505e-02, 1.06281160e-02, 4.32279154e-03]]))

The tuple contains:

The topic prediction [-1]
An array of probabilities for the other remaining topics

Am I right?

Why is the topic prediction [-1] when I expressly removed the outliers and updated the model? If I simply call model.topic_labels_ I don't get the [-1] topic anymore and it starts from [0] as expected. Why is my model still predicting the topic [-1]?

Moreover, I guess the array of probabilities is following the normal ascending order (in my case from 0 to 23). Right? If that's the case I can imagine that the 8th value in this list [4.58492993e-01] (9th if counting non-pythonically) is the one corresponding to the most corresponding topic. Is this true?

I am asking you this so that I can easily predict the top N topics for a single new document.

I have reviewed the BERTopic documentation and explored the available methods, but I couldn't find a straightforward way to obtain the probabilities for all topics beyond the top 1 prediction for a single new document.

Could you please help me? I would greatly appreciate any assistance or suggested workarounds for achieving this.

Thank you in advance!

MaartenGr commented 1 year ago

Why is the topic prediction [-1] when I expressly removed the outliers and updated the model? If I simply call model.topiclabels I don't get the [-1] topic anymore and it starts from [0] as expected. Why is my model still predicting the topic [-1]?

This indeed should not be the case assuming there is not a single -1 to be found in new_topics. Could you check that? Also, could you share your full code? There might be some nuances here and there that could explain things. Lastly, what version of BERTopic are you using?

Moreover, I guess the array of probabilities is following the normal ascending order (in my case from 0 to 23). Right? If that's the case I can imagine that the 8th value in this list [4.58492993e-01] (9th if counting non-pythonically) is the one corresponding to the most corresponding topic. Is this true?

The probabilities returned (assuming you did not load in a previously trained model and you used calculate_probabilities=True) gives back the probabilities for all non-outlier topics in ascending order.

I have reviewed the BERTopic documentation and explored the available methods, but I couldn't find a straightforward way to obtain the probabilities for all topics beyond the top 1 prediction for a single new document.

There are several ways of doing this. First, is by setting calculate_probabilities=True. Then you can access the document-topic distributions with topic_model.probabilities_. Second, is by applying .approximate_distribution to generate the probabilities after training your initial model.

VioletRaven commented 1 year ago

Hello there,

BERTopic version --- 0.15.0

Code generating these results:

umap_args = {
'n_neighbors' : int(value),
'n_components' : int(value),
'min_dist' : value,
'metric' : cosine,
'random_state' : 42
}
hdbscan_args = {
'min_cluster_size' : int(value),
'min_samples' :int(value),
'cluster_selection_epsilon' : float(value), # needs to be casted as float
'prediction_data' : True  # needed when making inference
}
ctfidf_model = ClassTfidfTransformer()
reduce_frequent_words = value 
umap_model = UMAP(**umap_args)
hdbscan_model = HDBSCAN(**hdbscan_args)

model = BERTopic(
        calculate_probabilities=True,
        umap_model=umap_model,             
        hdbscan_model=hdbscan_model,        
        vectorizer_model=TfidfVectorizer(stop_words=nltk.corpus.stopwords.words('italian')),  
        ctfidf_model=ctfidf_model,          
        nr_topics=None,
        language=None
)

topics, probs = model.fit_transform(docs, embeddings = emb)
model_topics_info = model.get_topic_info()
print('Old Topics')
print(model_topics_info)
new_topics = model.reduce_outliers(docs, topics)
model_topics_info = model.get_topic_info()
print('Topics After Reduction')
print(model_topics_info)
print("New Topics:", new_topics)
#update topics
model.update_topics(docs, topics=new_topics)
model_topics_info = model.get_topic_info()
print('Topics After Updating')
print(model_topics_info)

Old topics for a total of 25 topics:

New topics after .reduce_outliers and update_topics for a total of 25 topics:

For the embedding of the single new document (I am using a HF 16k tokens model and embedding dimension of (1, 768) ) :

from . import Embedder
new_doc = """string containing many words in italian"""
embedder = Embedder(model)
embedder.generate_embeddings(document = new_doc)
# need to expand dims and list the single document
emb = np.expand_dims(np.array(emb), axis = 0)
new_doc = [new_doc]

prediction = model.transform(documents=new_doc, embeddings=emb)

prediction = ([-1],
 array([[1.16350449e-02, 2.01218509e-01, 1.11319454e-02, 7.07114037e-04,
         3.97837834e-03, 2.08542765e-04, 3.01921767e-03, 7.32074922e-02,
         4.58492993e-01, 9.87979027e-03, 9.41436941e-03, 1.09079184e-02,
         1.02977729e-02, 9.64706333e-03, 1.12269956e-02, 1.04989969e-02,
         3.24051104e-03, 1.23429396e-02, 1.00992529e-02, 9.99152714e-03,
         1.06927620e-02, 1.20518505e-02, 1.06281160e-02, 4.32279154e-03]]))

Unfortunately, the probabilities for all non-outliers topics are not in ascending order :(

values = prediction[1][0]  # Extract the values from the array
indices = np.argsort(-values)  # Sort indices in decreasing order
sorted_values = sorted_values = values[indices]
print(sorted_values)
print(indices)

[4.58492993e-01 2.01218509e-01 7.32074922e-02 1.23429396e-02
 1.20518505e-02 1.16350449e-02 1.12269956e-02 1.11319454e-02
 1.09079184e-02 1.06927620e-02 1.06281160e-02 1.04989969e-02
 1.02977729e-02 1.00992529e-02 9.99152714e-03 9.87979027e-03
 9.64706333e-03 9.41436941e-03 4.32279154e-03 3.97837834e-03
 3.24051104e-03 3.01921767e-03 7.07114037e-04 2.08542765e-04]
[ 8  1  7 17 21  0 14  2 11 20 22 15 12 18 19  9 13 10 23  4 16  6  3  5]

Thank you for your kind support! :)

MaartenGr commented 1 year ago

Unfortunately, the probabilities for all non-outliers topics are not in ascending order :(

They are indeed not in ascending order with respect to their values but they are in ascending order with respect to their topic id. In other words, topic 0 should be at index 0, topic 1 at index 1, etc.

VioletRaven commented 1 year ago

Ok, that's great news. However, the problem still remains when making the prediction. How can I get rid of the [-1] prediction?

MaartenGr commented 1 year ago

Whenever you face a [-1] prediction, simply find the index of the highest value in the corresponding probability vector. That will then be your non-outlier topic.

VioletRaven commented 1 year ago

Yeah, I was actually thinking the same thing, so it's good to know we're on the same page! By the way, I noticed that when I add up the probability values, they don't always equal 1.

[4.58492993e-01 2.01218509e-01 7.32074922e-02 1.23429396e-02 1.20518505e-02 1.16350449e-02 1.12269956e-02 1.11319454e-02 1.09079184e-02 1.06927620e-02 1.06281160e-02 1.04989969e-02 1.02977729e-02 1.00992529e-02 9.99152714e-03 9.87979027e-03 9.64706333e-03 9.41436941e-03 4.32279154e-03 3.97837834e-03 3.24051104e-03 3.01921767e-03 7.07114037e-04 2.08542765e-04]

In this case, their sum is 0.9088418947404864. Maybe that's why we still see a bit of the [-1] topic hanging around, since it could account for the remaining 0.09115810525951362. Although, if this was the case it still has a smaller probability compared to some other topic.

MaartenGr commented 1 year ago

The -1 probability is indeed 1 - sum(probabilities). These probabilities, however, are generally an approximation of the assignment and not actually core part of it. This is HDBSCAN-specific behavior and you can find more about how it handles soft clustering here and here.

MaartenGr / BERTopic

How to get top N topics for a single document. #1398