Open SkylarOconnell opened 3 weeks ago
You are using an older version of BERTopic and I remember that there were some fixes since then. Could you try it with the latest version instead? 0.16.4.
Got it trying that now!
Just tried increasing the version of Bertopic to 0.16.4 and still the same issue.
Initial Training: Item 1: Topic: 3 Probability: 0.9999968824322634 Item 2: Topic: 2 Probability: 0.883032787750728 Item 3: Topic: 4 Probability: 0.9902231709346468
Inference/Transform without saving: Item 1: Topic: 3 Probability: 0.9999968824322634 Item 2: Topic: 2 Probability: 0.883032787750728 Item 3: Topic: 4 Probability: 0.9902231709346468
Inference/Training after saving and redownloading using safetensor: Item 1: Topic: 4 Probability: 0.9999788403511047 Item 2: Topic: 3 Probability: 0.9999911785125732 Item 3: Topic: 5 Probability: 0.999993085861206
All topics (except outliers) are coming out one more than the original run or the original model without saving it
@MaartenGr I also just tried saving with pytorch as well and got the same issue
Hmnmmm, this is quite unexpected. I'm a bit baffled here considering these probabilities are extremely high.
My guess would be that there is something going wrong with reducing outliers before updating and then saving the model. What would happen if you didn't reduce outliers?
@MaartenGr Removing reduce outliers fixes the issue and now I am getting the same results between the initial training and the inference run after downloading. Is there a way to keep reduce outliers or is this a bug that would need to be fixed first?
@SkylarOconnell I'm not actually sure why this is happening. It could be that by reducing outliers so much, it distorts the newly created topic embeddings (topic_model.topic_embeddings_
). You could choose to save the topic embeddings before outlier reduction, and then re-assign them after reducing outliers.
@MaartenGr Could you provide an example for this? I'm not really sure how to do that.
@SkylarOconnell Sure!
# Track topic embeddings before reducing outliers
topic_embeddings = topic_model.topic_embeddings_
# Reduce outliers and update topics
new_topics = self.model.reduce_outliers(
self.docs,
self.topics,
probabilities=self.probabilities,
strategy='probabilities'
)
self.model.update_topics(self.docs, topics=new_topics)
# Reassign old topic embeddings
topic_model.topic_embeddings_ = topic_embeddings
When doing this, make sure whether the old topic embeddings are correctly assigned as I'm not sure whether this creates a shallow or deep copy.
@MaartenGr Sorry for the delayed response.
When I add in the code above (changing topic_model to self.model since we are using class variables), it goes back to the original issue. Could it be an issue/bug between reduce_outliers and pytorch/safetensor? Reduced outliers works and the transform works until I save with those and redownload.
topic_embeddings = self.model.topic_embeddings_
new_topics = self.model.reduce_outliers(
self.docs,
self.topics,
probabilities=self.probabilities,
strategy='probabilities'
)
self.model.update_topics(self.docs, topics=new_topics)
# Reassign old topic embeddings
self.model.topic_embeddings_ = topic_embeddings
I'm not sure if I understand correctly. Just to make sure:
self.model.topic_embeddings_
now has the old topic embeddings right? So we can be sure that the old topic embeddings are kept.I will double check the top bullet and let you know. If the topic_embeddings are the same as the old embeddings, I will run a quick count to see how many are off. I'll respond here once I am able to do so.
Have you searched existing issues? π
Desribe the bug
After training, I tried saving the model using both pytorch and safetensor. When I re-download the model, load the files into Bertopic using Bertopic.load(), and run inference using transform(), all the topics are coming out differently than the original fit results. Below are some examples the first topic and prob is from the original training/fit of the model and the second is from running transform():
Topic: 2 Probability: 0.9999999985560923 vs. Topic: 3 Probability: 0.9999477863311768
Topic: 1 Probability: 0.9993163446248252 vs. Topic: 2 Probability: 0.04614641437377926
Topic: 2 Probability: 1.0 vs. Topic: 3 Probability: 0.9591490626335144
One thing to note is that running transform over and over comes out with the same results that are different than the original training output. Also, when I run transform on the original model without saving it anywhere else, I get the same results as the original run. I was wondering if I am missing something with saving the model correctly. Below is the code I use to train, save, and run transform on the model. We also run reduce_outliers() before saving the model.
Reproduction
BERTopic Version
0.16.0