MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Expected behavior of transform after outlier reduction? #1349

Closed stevetracvc closed 1 year ago

stevetracvc commented 1 year ago

I don't know if this is an error in the code, or an error in my expectation of what should happen after outlier reduction.

I wrote a test_reduce_outliers.py as shown below

import copy
import pytest

@pytest.mark.parametrize('model', [('base_topic_model'),
                                   ('custom_topic_model'),
                                   ('merged_topic_model'),
                                   ('reduced_topic_model'),
                                   ('online_topic_model')])
def test_reduce_outliers(model, documents, request):
    topic_model = copy.deepcopy(request.getfixturevalue(model))
    topics = topic_model.topics_
    probs = topic_model.probabilities_

    # find the outliers
    orig_document_info = topic_model.get_document_info(documents)
    outliers = orig_document_info[ orig_document_info['Topic'] == -1]

    new_topics = topic_model.reduce_outliers(
        documents, topics,
        probabilities=probs, strategy="probabilities",
        threshold=0,
        )
    topic_model.update_topics(
        documents,
        topics=new_topics,
        top_n_words=topic_model.top_n_words,
        vectorizer_model=topic_model.vectorizer_model,
        ctfidf_model=topic_model.ctfidf_model,
        representation_model=topic_model.representation_model,
        )

    # pick the ID of the first outlier
    idx = outliers.iloc[0].name
    reduced_topic = topic_model.get_document_info(documents).loc[idx, 'Topic']
    # transform this document (yes, again)
    transform_results = topic_model.transform(documents[idx])
    if isinstance(transform_results, tuple):
        # returned topic and probabilities
        new_topic = transform_results[0][0]
    else:
        new_topic = transform_results[0]

    # this assert fails, comparing new transform to the outlier-reduced doc
    assert new_topic == reduced_topic
    # this assert will also fail
    assert new_topic != -1
    # and this one too!
    if isinstance(transform_results, tuple):
        assert reduced_topic == transform_results[1].argmax()

My expectation is that, after outlier reduction, then the model will be more lenient when classifying a document. However, if you use outlier reduction, then call transform with one of the documents that originally was an outlier, then it is still classified as an outlier (asserts 1 & 2). So what is the most appropriate way to handle transforming new documents after using outlier reduction? If using the probabilities method, then it seems like I can just take the argmax of the probabilities. But what if I'm using the embeddings method instead? How would I correctly place a new document? It appears that I'd have to recreate what's done in the reduce_outliers function...?

And even more baffling, when using the "probabilities" method for outlier reduction, the topic chosen after outlier reduction is not the same topic as the topic with the highest probability (third assert), though I think this is related to embedding calculations. eg,

np.isclose(topic_model._extract_embeddings(documents[0:1], method="document")[0], topic_model._extract_embeddings(documents[0:200], method="document")[0]).all()

is False, which means when transforming a list of documents, the surrounding documents seem to influence the embeddings (which seems to be an issue with SentenceTransformers rather than BERTopic)

MaartenGr commented 1 year ago

The outlier reduction method does not update the topics nor the internal assignment of topics. You would indeed have to use outlier reduction after transform if necessary. The reason for this boils down to modularity. You might want to test out different strategies or even have different strategies between fit and transform.

With respect to the probabilities, this is actually to be expected. They are generated with HDBSCAN which calculates probabilities after generating the clusters. In other words, they are an approximation of its internal structure. You would have to use a different clustering algorithm that relies on the probabilities directly to have them the same.

stevetracvc commented 1 year ago

Maarten, thanks for the response. I hadn't thought of using the outlier reduction on newly-transformed results.

I started working on the second part of your answer, but then got distracted with work. I'm still unclear on something but I'll post a follow up when I have suitable code

stevetracvc commented 1 year ago

Is there a reason why the default action of update_topics is to overwrite the existing models? Ie,

        self.vectorizer_model = vectorizer_model or CountVectorizer(ngram_range=n_gram_range)
        self.ctfidf_model = ctfidf_model or ClassTfidfTransformer()
        self.representation_model = representation_model

rather than only overwriting if new models are provided?

        self.vectorizer_model = vectorizer_model or self.vectorizer_model
        self.ctfidf_model = ctfidf_model or self.ctfidf_model
        self.representation_model = representation_model or self.representation_model
MaartenGr commented 1 year ago

The reason for this is that .update_topics defaults back to the original representation when you initialize the topic model without tuning any parameters. Essentially, it is a way to return your topic model to its original representation without any tuning. This would not be easily possible if they would not overwrite the given representation models.