MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Error in transform probabilities #1807

Open anirban-mu opened 8 months ago

anirban-mu commented 8 months ago

I periodically seem to encounter the following error:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 550, in transform
    probabilities = self._map_probabilities(probabilities, original_topics=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 4124, in _map_probabilities
    mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 14 is out of bounds for axis 1 with size 14

I am unsure of how to help debug it because it only appears in some runs and not others. In each case there is a BERTopic model of the form BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True), I have fitted the model successfully using fit_transform, and then called transform to compute topics and probabilities in a new sample. In addition, in each case, I provide both the documents and the embeddings. The code operates over a collection of sets of documents so its run as follows:

for key in topic_models:
    topics[key], _ = topic_models[key].fit_transform(datasets[key], embeddings[key])

I know the models fit successfully because I can obtain topics from them and there does not seem to be an error. It is only when calling transform that an error periodically manifests. Its stochastic appearance suggests it has something to do with the fitted topics but I am entirely unclear as to how to debug.

In this code:

# Map array of probabilities (probability for assigned topic per document)
        if probabilities is not None:
            if len(probabilities.shape) == 2:
                mapped_probabilities = np.zeros((probabilities.shape[0],
                                                 len(set(mappings.values())) - self._outliers))
                for from_topic, to_topic in mappings.items():
                    if to_topic != -1 and from_topic != -1:
                        mapped_probabilities[:, to_topic] += probabilities[:, from_topic]

                return mapped_probabilities

        return probabilities

Is to_topic guaranteed to be sequential? Could there be a gap in the indices? I don't know the code base well enough but len(set(mappings.values())) may be the issue? Maybe something like:

if probabilities is not None:
    if len(probabilities.shape) == 2:
        # Find the maximum 'to_topic' index, ensuring the array is large enough
        max_to_topic = max(mappings.values())

        # Initialize 'mapped_probabilities' with a size based on the maximum index found
        mapped_probabilities = np.zeros((probabilities.shape[0], max_to_topic + 1 - self._outliers))

        for from_topic, to_topic in mappings.items():
            if to_topic != -1 and from_topic != -1:
                # Safely add probabilities, knowing 'mapped_probabilities' has enough columns
                mapped_probabilities[:, to_topic] += probabilities[:, from_topic]

        # If necessary, additional steps to handle outliers or resize the array can be added here

        return mapped_probabilities

In this code, the case of non sequential indices is handled naturally. I do not, however, know if non sequential indices are symptomatic of a deeper issue. HTH.

I should note that I am unclear of exactly what was going on with self._outliers so I left it in. Maybe this should be max_to_topic + 1? That is what I would have done without the self._outliers but I left self._outliers in because I don't understand (have not had the time to look that carefully) what it is.

MaartenGr commented 8 months ago

Hmmm, it is difficult to say without seeing how you instantiated your models. Could you share your full code with respect to that? There might be something going in with the variant of BERTopic that you are using or any other changes that you might have made to the model.

anirban-mu commented 8 months ago

I can't share the original code as the variable names etc. are all linked to things I cannot share. I have tried to create a MWE taking away those elements and replacing them (particularly, the data) but I am unable to reproduce the error. Sorry, I know this is crucial to debugging but when I try to create a MWE, I end up with a pretty generic version that works.

MaartenGr commented 8 months ago

Hmmm, this is quite difficult. Without a way for me to reproduce the issue, I am not sure if I can uncover what the exact issue is exactly. It is like looking for a needle in a haystack without knowing what the needle actually is.

Let's approach it a bit differently then. Could you share what is inside BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True)? These variables might give some clues.

anirban-mu commented 8 months ago

This is the code after stripping out variables names etc. so there is a non zero probability of errors introduced in the changes. In short, two models are run on each df -- one on column1 and the other on column2. The topic probabilities from these models when given both columns concatenated is the object of interest. This way the model estimated on column1 is used to assign probabilities for both column1 and column2, the one on column2 likewise. Thus for two columns and 2 datasets, I end up with 4 matrices of probabilities. The embeddings are precomputed and saved. These are likewise stacked vertically mirroring the concatenation of inputs.

# Import necessary libraries
import ast
import openai
import numba
import numpy as np
from openai import OpenAI
import pandas as pd
import umap.umap_ as umap
import sys
import os

from hdbscan import HDBSCAN
from bertopic import BERTopic
from bertopic.backend import OpenAIBackend
from bertopic.representation import OpenAI as OAI

openai.api_key = ""
OAI_client = OpenAI(
    api_key="",
)

# Step 1 - Extract embeddings using OpenAI's Ada model
# Changing embedding_model does not make a difference AFAIK
embedding_model = OpenAIBackend(
    embedding_model="text-embedding-3-large", delay_in_seconds=1, batch_size=1024
)

# Step 2 - Reduce dimensionality using UMAP
# UMAP parameters are chosen based on dataset characteristics and desired dimensionality reduction
umap_model = umap.UMAP(
    n_neighbors=2500, n_components=72, min_dist=0.01, metric="cosine"
)

# Step 3 - Cluster reduced embeddings using HDBSCAN
# The 'leaf' method is used for cluster selection for potentially better-defined clusters
hdbscan_model = HDBSCAN(
    cluster_selection_method="leaf", min_cluster_size=125, prediction_data=True
)

prompt_text = "Identify the primary topic in the reviews represented by the following documents and keywords: [DOCUMENTS] [KEYWORDS]. Provide only the topic label."

# Step 4 - Determine Topic representations using GPT-4 from OpenAI
# Changing model does not make a difference AFAIK
representation_model = OAI(
    client=OAI_client,
    model="gpt-4-turbo-preview",
    chat=True,
    exponential_backoff=True,
    nr_docs=12,
    prompt=prompt_text,
)

# Dictionary to hold BERTopic models
topic_models = {
    "a1": BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
    ),
    "a2": BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
    ),
    "b1": BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
    ),
    "b2": BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
    ),
    ### Dictionary has more models
}

# Dictionary to hold datasets
datasets = {
    "a1": some_documents,
    "a2": some_documents,
    "b1": some_documents,
    "b2": some_documents,
    ### More data
}

def load_embeddings(base_path, file_name):
    # Load the DataFrame from a pickle file
    df = pd.read_pickle(f"{base_path}/embedding_{file_name}.pkl")
    # Assuming the embeddings are already lists in the first column, directly convert to a NumPy array
    numpy_array = np.array([row for row in df.iloc[:, 0]])
    return numpy_array

# Load embeddings and process with UMAP
base_path = ""
embedding_names = [
    "a1",
    "a2",
    "b1",
    "b2",
]  # More names
embeddings = {name: load_embeddings(base_path, name) for name in embedding_names}

# Fit and transform the BERTopic models
topics = {}
original_probabilities = {}
for key in topic_models:
    topics[key], original_probabilities[key] = topic_models[key].fit_transform(
        datasets[key], embeddings[key]
    )

# Datasets and embeddings
datasets = {"big_a": ("a1", "a2", df_a), "big_b": ("b1", "b2", df_b)}

# Process each dataset
for name, (key1, key2, df) in datasets.items():
    # Concatenate 'column1' and 'column2' columns
    combined_df = pd.concat(
        [df["column1"].to_frame(name="data"), df["column2"].to_frame(name="data")],
        axis=0,
    )
    setattr(sys.modules[__name__], f"combined_{name}", combined_df)

    # Concatenate embeddings
    combined_embedding = np.vstack([embeddings[key1], embeddings[key2]])
    setattr(sys.modules[__name__], f"combined_{name}_embedding", combined_embedding)

# Initialize dictionaries to store probabilities
probabilities_dict = {}

# Performing predictions using the models from the dictionary
for dataset in ["big_a", "big_b"]:
    for model_key in ["1", "2"]:
        key = f"{dataset[0]}_{model_key}"
        _, probabilities_dict[key] = topic_models[key].transform(
            documents=getattr(sys.modules[__name__], f"combined_{dataset}"),
            embeddings=getattr(sys.modules[__name__], f"combined_{dataset}_embedding"),
        )
MaartenGr commented 8 months ago

In all honest, I do not see anything in your code that might explain this issue. It should work and I am quite surprised that it does not. There might be a solution though.

If you save the model using safetensors or pytorch and then load in the model, the method for performing the prediction changes and as such might prevent the issue from raising.

anirban-mu commented 8 months ago

Ok let me try that. I think it has some thing to do with the object passed back by UMAP/HDBScan because when I change parameters it seems to fail or at least fail more/less often. Thanks for looking into it.

MaartenGr commented 8 months ago

No problem, let me know if it works out!

anirban-mu commented 8 months ago

I am unable to resolve. It has something to do with what HDBScan returns as I am fairly certain everything is ok till the clustering step. From there, failure happens unpredictably when the corresponding transform method is called in HDBScan and then the probabilities are mapped.

I don't think the issue in HDBScan but my conjecture is that maybe HDBScan does not assign any new documents to a topic from the new set of documents such that the returned probabilities array is smaller (maybe zeroed out columns are dropped)? I say this because the single most predictive parameter of an issue is min_cluster_size such that when this is larger (corresponding to clusters that apply to many documents, and hence likely to many new documents) I am less likely to see and error. When this is smaller, errors become more frequent.

As I cannot find a way to chase this down, I am going to leave the bug open with the hope that better programmers can chase it down.

MaartenGr commented 8 months ago

Thank you for sharing this! Hopefully, someone else can help out by creating a reproducible example to track down the issue. Indeed, let's keep this open and see if others can provide some help.