Saving a trained model using pytorch and safetensor and then redownloading causes topics to be off

SkylarOconnell commented 3 weeks ago

Have you searched existing issues? 🔎

[x] I have searched and found no existing issues

Desribe the bug

After training, I tried saving the model using both pytorch and safetensor. When I re-download the model, load the files into Bertopic using Bertopic.load(), and run inference using transform(), all the topics are coming out differently than the original fit results. Below are some examples the first topic and prob is from the original training/fit of the model and the second is from running transform():

Topic: 2 Probability: 0.9999999985560923 vs. Topic: 3 Probability: 0.9999477863311768

Topic: 1 Probability: 0.9993163446248252 vs. Topic: 2 Probability: 0.04614641437377926

Topic: 2 Probability: 1.0 vs. Topic: 3 Probability: 0.9591490626335144

One thing to note is that running transform over and over comes out with the same results that are different than the original training output. Also, when I run transform on the original model without saving it anywhere else, I get the same results as the original run. I was wondering if I am missing something with saving the model correctly. Below is the code I use to train, save, and run transform on the model. We also run reduce_outliers() before saving the model.

Reproduction

from bertopic import BERTopic

self.model_params = {
            'min_topic_size': int((len(rows) / 160) - 1),
            'calculate_probabilities': True,
            'verbose': True,
            'umap_model': UMAP(
                n_neighbors=50,
                n_components=20,
                metric='cosine',
                low_memory=False,
                random_state=42,
            )
        }

self.model = BERTopic(**self.model_params)

self.topics, self.probabilities = self.model.fit_transform(
            documents=self.docs,
            embeddings=numpy.array(self.embeddings),
            y=self.labels
        )

new_topics = self.model.reduce_outliers(
            self.docs,
            self.topics,
            probabilities=self.probabilities,
            strategy='probabilities'
        )

self.model.update_topics(self.docs, topics=new_topics)

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
self.model.save(torch_file_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

new_model = BERTopic.load(artifact_path)

new_model_temp_topics, new_model_temp_probabilities = new_model.transform(documents=self.docs, embeddings=numpy.array(self.embeddings))

BERTopic Version

0.16.0

MaartenGr commented 3 weeks ago

You are using an older version of BERTopic and I remember that there were some fixes since then. Could you try it with the latest version instead? 0.16.4.

SkylarOconnell commented 3 weeks ago

Got it trying that now!

SkylarOconnell commented 3 weeks ago

Just tried increasing the version of Bertopic to 0.16.4 and still the same issue.

Initial Training: Item 1: Topic: 3 Probability: 0.9999968824322634 Item 2: Topic: 2 Probability: 0.883032787750728 Item 3: Topic: 4 Probability: 0.9902231709346468

Inference/Transform without saving: Item 1: Topic: 3 Probability: 0.9999968824322634 Item 2: Topic: 2 Probability: 0.883032787750728 Item 3: Topic: 4 Probability: 0.9902231709346468

Inference/Training after saving and redownloading using safetensor: Item 1: Topic: 4 Probability: 0.9999788403511047 Item 2: Topic: 3 Probability: 0.9999911785125732 Item 3: Topic: 5 Probability: 0.999993085861206

All topics (except outliers) are coming out one more than the original run or the original model without saving it

SkylarOconnell commented 3 weeks ago

@MaartenGr I also just tried saving with pytorch as well and got the same issue

MaartenGr commented 2 weeks ago

Hmnmmm, this is quite unexpected. I'm a bit baffled here considering these probabilities are extremely high.

My guess would be that there is something going wrong with reducing outliers before updating and then saving the model. What would happen if you didn't reduce outliers?

SkylarOconnell commented 2 weeks ago

@MaartenGr Removing reduce outliers fixes the issue and now I am getting the same results between the initial training and the inference run after downloading. Is there a way to keep reduce outliers or is this a bug that would need to be fixed first?

MaartenGr commented 2 weeks ago

@SkylarOconnell I'm not actually sure why this is happening. It could be that by reducing outliers so much, it distorts the newly created topic embeddings (topic_model.topic_embeddings_). You could choose to save the topic embeddings before outlier reduction, and then re-assign them after reducing outliers.

SkylarOconnell commented 2 weeks ago

@MaartenGr Could you provide an example for this? I'm not really sure how to do that.

MaartenGr commented 2 weeks ago

@SkylarOconnell Sure!

# Track topic embeddings before reducing outliers
topic_embeddings = topic_model.topic_embeddings_

# Reduce outliers and update topics
new_topics = self.model.reduce_outliers(
self.docs,
self.topics,
probabilities=self.probabilities,
strategy='probabilities'
)
self.model.update_topics(self.docs, topics=new_topics)

# Reassign old topic embeddings
topic_model.topic_embeddings_ = topic_embeddings

When doing this, make sure whether the old topic embeddings are correctly assigned as I'm not sure whether this creates a shallow or deep copy.

SkylarOconnell commented 1 week ago

@MaartenGr Sorry for the delayed response.

When I add in the code above (changing topic_model to self.model since we are using class variables), it goes back to the original issue. Could it be an issue/bug between reduce_outliers and pytorch/safetensor? Reduced outliers works and the transform works until I save with those and redownload.

topic_embeddings = self.model.topic_embeddings_
new_topics = self.model.reduce_outliers(
            self.docs,
            self.topics,
            probabilities=self.probabilities,
            strategy='probabilities'
)

self.model.update_topics(self.docs, topics=new_topics)

# Reassign old topic embeddings
self.model.topic_embeddings_ = topic_embeddings

MaartenGr commented 1 week ago

I'm not sure if I understand correctly. Just to make sure:

You double checked that the self.model.topic_embeddings_ now has the old topic embeddings right? So we can be sure that the old topic embeddings are kept.
If so, you get the same issue as before right? The one where topics do not match up? Could you check how many do not match up? It is not uncommon that only 70% or so matches up since it is a different procedure.
Lastly, do you have a fully reproducible example that I can use along with data? Otherwise, it's difficult for me to debug this without more info.

SkylarOconnell commented 1 week ago

I will double check the top bullet and let you know. If the topic_embeddings are the same as the old embeddings, I will run a quick count to see how many are off. I'll respond here once I am able to do so.

MaartenGr / BERTopic