MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.07k stars 756 forks source link

Facing problem when loading BERTopic Online Model #1703

Open ShyamGanesh13 opened 10 months ago

ShyamGanesh13 commented 10 months ago

Hi @MaartenGr ,

**I am facing the below issue when, I try to load an Online BERTopic model from my disk.

Issue --> 'In order to use .partial_fit, the cluster model should have a .partial_fit function.'

I tried to debug the code flow where I found that in BERTopic.load() we have a function call to _create_model_from_files() which loads the saved model. The logic goes like this

empty_dimensionality_model = BaseDimensionalityReduction()
    empty_cluster_model = BaseCluster()

    # Fit BERTopic without actually performing any clustering
    topic_model = BERTopic(
            embedding_model=embedding_model,
            umap_model=empty_dimensionality_model,
            hdbscan_model=empty_cluster_model,
            **params
    )

Here, for Online Topic Models we use River.DBSTREAM as cluster model. But when we save the model using BERTopic.save() [in safetensors format] we are not storing the cluster model data.... Because of that, While loading back we face this issue..

Is there any quick fix for this issue??**

MaartenGr commented 10 months ago

Since you saved the model as safetensors, there is unfortunately no fix for this issue. With safetensors, the cluster and dimensionality reduction models are not saved. The .partial_fit function needs the fitted models in order to continue training, without it this is not possible.

Instead, you will have to retrain your model and save it using pickle instead. However, it would advise using the newly released .merge_models functionality instead to perform your incremental learning.