MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

6.09k stars 759 forks source link

Loaded Model wont .fit or fit and transform #1584

Open ZhenDeDSML opened 1 year ago

ZhenDeDSML commented 1 year ago

Hello Maarten, I loved getting to learn BERTopic for a recent project, I got my model to work properly, and using this code

embedding_model = "sentence-transformers/all-MiniLM-L6-v2" stored_models_path = "./Stored_Models/"
model.save(stored_models_path, serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)

I saved it however when I run

topics, probability = loaded_model.fit(doc2) or just loaded_model.fit(doc2)

I get this erorr

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Do you know what I could be doing wrong? when I run the loaded_model cell I get a value of

<bertopic._bertopic.BERTopic at 0x1ae00af4710> so I think it saved properly. (I have also tried the pickling method with no luck)

MaartenGr commented 1 year ago

You generally do not want to fit a saved model again because it does not fit better or worse when you are using a saved model. Instead, I would advise using a completely fresh model instead since fitting a model, new or saved, will just create an entirely new one from scratch.

ZhenDeDSML commented 1 year ago

Hi Maarten, maybe I am being a bit naive, but how do we use the same model to predict on a different set of data? say we want to train on 80% of the data to see the accuracy of the 20% of data not being used, would we be able to do something like that with BertTopic? The only reason I ask is because I spent a good amount of time merging the topics as well, and was wondering if those merged topics are preserved.

I am also assuming that a traditional train test split doesnt work for this type of modeling

Thanks again

MaartenGr commented 1 year ago

You can fit a model with either fit or fit_transform. After having created your model, you can use it to predict on a different set of data with .transform.

ZhenDeDSML commented 1 year ago

Thanks Maarten, I was just wondering if there is any way for the merged topics to be saved when retraining, but from my understanding I don't think this is possible given the nature of Topic Modeling in general?

MaartenGr commented 1 year ago

This is actually not possible for most models in general. Most fit models will re-train the model completely from scratch without saving their previous state. What you are describing is online topic modeling, akin to partial_fit. Instead, it might be worthwhile to train an entirely new model and merge it with a previous model.

This merging of models is new and implemented in the main branch of this repo. You can find some documentation for it here.

ZhenDeDSML commented 1 year ago

Hi Maarten thanks for the insight, what I was referring to was the merge that is done manually like this

merged_topics = [

merge topic 1

[10,11,13]

]

youre saying there's no way to save these merges in this exact fashion without having to manually redo it each time we want to pare down our number of topics?

MaartenGr commented 1 year ago

It depends on what you want but generally, if you have merged topics after having created a topic model, then you can absolutely save the results.

However, if you want to retrain the results, then it makes sense that all previous results will be discarded. Retraining nearly always means from scratch.

Having said that, if you have merged topics that you would like to keep and find new, previously undiscovered, topics, then you have to create a new model on new data and merge that model with the previous model.