Open anirban-mu opened 8 months ago
Hmmm, it is difficult to say without seeing how you instantiated your models. Could you share your full code with respect to that? There might be something going in with the variant of BERTopic that you are using or any other changes that you might have made to the model.
I can't share the original code as the variable names etc. are all linked to things I cannot share. I have tried to create a MWE taking away those elements and replacing them (particularly, the data) but I am unable to reproduce the error. Sorry, I know this is crucial to debugging but when I try to create a MWE, I end up with a pretty generic version that works.
Hmmm, this is quite difficult. Without a way for me to reproduce the issue, I am not sure if I can uncover what the exact issue is exactly. It is like looking for a needle in a haystack without knowing what the needle actually is.
Let's approach it a bit differently then. Could you share what is inside BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True)
? These variables might give some clues.
This is the code after stripping out variables names etc. so there is a non zero probability of errors introduced in the changes. In short, two models are run on each df -- one on column1 and the other on column2. The topic probabilities from these models when given both columns concatenated is the object of interest. This way the model estimated on column1 is used to assign probabilities for both column1 and column2, the one on column2 likewise. Thus for two columns and 2 datasets, I end up with 4 matrices of probabilities. The embeddings are precomputed and saved. These are likewise stacked vertically mirroring the concatenation of inputs.
# Import necessary libraries
import ast
import openai
import numba
import numpy as np
from openai import OpenAI
import pandas as pd
import umap.umap_ as umap
import sys
import os
from hdbscan import HDBSCAN
from bertopic import BERTopic
from bertopic.backend import OpenAIBackend
from bertopic.representation import OpenAI as OAI
openai.api_key = ""
OAI_client = OpenAI(
api_key="",
)
# Step 1 - Extract embeddings using OpenAI's Ada model
# Changing embedding_model does not make a difference AFAIK
embedding_model = OpenAIBackend(
embedding_model="text-embedding-3-large", delay_in_seconds=1, batch_size=1024
)
# Step 2 - Reduce dimensionality using UMAP
# UMAP parameters are chosen based on dataset characteristics and desired dimensionality reduction
umap_model = umap.UMAP(
n_neighbors=2500, n_components=72, min_dist=0.01, metric="cosine"
)
# Step 3 - Cluster reduced embeddings using HDBSCAN
# The 'leaf' method is used for cluster selection for potentially better-defined clusters
hdbscan_model = HDBSCAN(
cluster_selection_method="leaf", min_cluster_size=125, prediction_data=True
)
prompt_text = "Identify the primary topic in the reviews represented by the following documents and keywords: [DOCUMENTS] [KEYWORDS]. Provide only the topic label."
# Step 4 - Determine Topic representations using GPT-4 from OpenAI
# Changing model does not make a difference AFAIK
representation_model = OAI(
client=OAI_client,
model="gpt-4-turbo-preview",
chat=True,
exponential_backoff=True,
nr_docs=12,
prompt=prompt_text,
)
# Dictionary to hold BERTopic models
topic_models = {
"a1": BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
calculate_probabilities=True,
),
"a2": BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
calculate_probabilities=True,
),
"b1": BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
calculate_probabilities=True,
),
"b2": BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
calculate_probabilities=True,
),
### Dictionary has more models
}
# Dictionary to hold datasets
datasets = {
"a1": some_documents,
"a2": some_documents,
"b1": some_documents,
"b2": some_documents,
### More data
}
def load_embeddings(base_path, file_name):
# Load the DataFrame from a pickle file
df = pd.read_pickle(f"{base_path}/embedding_{file_name}.pkl")
# Assuming the embeddings are already lists in the first column, directly convert to a NumPy array
numpy_array = np.array([row for row in df.iloc[:, 0]])
return numpy_array
# Load embeddings and process with UMAP
base_path = ""
embedding_names = [
"a1",
"a2",
"b1",
"b2",
] # More names
embeddings = {name: load_embeddings(base_path, name) for name in embedding_names}
# Fit and transform the BERTopic models
topics = {}
original_probabilities = {}
for key in topic_models:
topics[key], original_probabilities[key] = topic_models[key].fit_transform(
datasets[key], embeddings[key]
)
# Datasets and embeddings
datasets = {"big_a": ("a1", "a2", df_a), "big_b": ("b1", "b2", df_b)}
# Process each dataset
for name, (key1, key2, df) in datasets.items():
# Concatenate 'column1' and 'column2' columns
combined_df = pd.concat(
[df["column1"].to_frame(name="data"), df["column2"].to_frame(name="data")],
axis=0,
)
setattr(sys.modules[__name__], f"combined_{name}", combined_df)
# Concatenate embeddings
combined_embedding = np.vstack([embeddings[key1], embeddings[key2]])
setattr(sys.modules[__name__], f"combined_{name}_embedding", combined_embedding)
# Initialize dictionaries to store probabilities
probabilities_dict = {}
# Performing predictions using the models from the dictionary
for dataset in ["big_a", "big_b"]:
for model_key in ["1", "2"]:
key = f"{dataset[0]}_{model_key}"
_, probabilities_dict[key] = topic_models[key].transform(
documents=getattr(sys.modules[__name__], f"combined_{dataset}"),
embeddings=getattr(sys.modules[__name__], f"combined_{dataset}_embedding"),
)
In all honest, I do not see anything in your code that might explain this issue. It should work and I am quite surprised that it does not. There might be a solution though.
If you save the model using safetensors
or pytorch
and then load in the model, the method for performing the prediction changes and as such might prevent the issue from raising.
Ok let me try that. I think it has some thing to do with the object passed back by UMAP/HDBScan because when I change parameters it seems to fail or at least fail more/less often. Thanks for looking into it.
No problem, let me know if it works out!
I am unable to resolve. It has something to do with what HDBScan returns as I am fairly certain everything is ok till the clustering step. From there, failure happens unpredictably when the corresponding transform method is called in HDBScan and then the probabilities are mapped.
I don't think the issue in HDBScan but my conjecture is that maybe HDBScan does not assign any new documents to a topic from the new set of documents such that the returned probabilities array is smaller (maybe zeroed out columns are dropped)? I say this because the single most predictive parameter of an issue is min_cluster_size such that when this is larger (corresponding to clusters that apply to many documents, and hence likely to many new documents) I am less likely to see and error. When this is smaller, errors become more frequent.
As I cannot find a way to chase this down, I am going to leave the bug open with the hope that better programmers can chase it down.
Thank you for sharing this! Hopefully, someone else can help out by creating a reproducible example to track down the issue. Indeed, let's keep this open and see if others can provide some help.
I periodically seem to encounter the following error:
I am unsure of how to help debug it because it only appears in some runs and not others. In each case there is a BERTopic model of the form
BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True)
, I have fitted the model successfully usingfit_transform
, and then calledtransform
to compute topics and probabilities in a new sample. In addition, in each case, I provide both the documents and the embeddings. The code operates over a collection of sets of documents so its run as follows:I know the models fit successfully because I can obtain topics from them and there does not seem to be an error. It is only when calling
transform
that an error periodically manifests. Its stochastic appearance suggests it has something to do with the fitted topics but I am entirely unclear as to how to debug.In this code:
Is to_topic guaranteed to be sequential? Could there be a gap in the indices? I don't know the code base well enough but
len(set(mappings.values()))
may be the issue? Maybe something like:In this code, the case of non sequential indices is handled naturally. I do not, however, know if non sequential indices are symptomatic of a deeper issue. HTH.
I should note that I am unclear of exactly what was going on with
self._outliers
so I left it in. Maybe this should bemax_to_topic + 1
? That is what I would have done without theself._outliers
but I leftself._outliers
in because I don't understand (have not had the time to look that carefully) what it is.