MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Updating and Pushing a BERTopic Model with New Documents to Hugging Face Hub still shows old no of training document #2071

Open sdave-connexion opened 4 months ago

sdave-connexion commented 4 months ago

Have you searched existing issues? 🔎

Desribe the bug

I have been using BERTopic for topic modelling and recently needed to update my existing BERTopic model with new documents. I want to push the updated model to the Hugging Face Hub, ensuring that it reflects the new number of documents and topics.

Here’s what I’ve done so far:

`new_topics, new_probs = topic_model.transform(lemmatized_docs, embeddings)`

Despite following these steps, I still see the old number of training documents in the repository on the Hugging Face Hub. How can I ensure that the updated model reflects the new number of training and topics?

Any help or guidance on this would be greatly appreciated!

Reproduction

from bertopic import BERTopic

# Load your existing BERTopic model
topic_model= BERTopic.load("shantanudave/BERTopic_ArXiv",embedding_model="sentence-transformers/all-MiniLM-L6-v2")

new_topics, new_probs = topic_model.transform(lemmatized_docs, embeddings)

new_model_name = "BERTopic_v2"

# Save the updated model locally using safetensors

embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(new_model_name, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

from huggingface_hub import login

# Authenticate with Hugging Face
login(token="your_hugging_face_token")

# Push the updated model to Hugging Face Hub
topic_model.push_to_hf_hub(
    repo_id=f"shantanudave/{new_model_name}",
    serialization="safetensors",
    save_ctfidf=True,
    save_embedding_model=embedding_model
)

BERTopic Version

pip install -U bertopic

MaartenGr commented 4 months ago

Updated the model with new documents:

That's the thing, you didn't update the model. When you use .transform, you are merely predicting the topics of the documents that you passed to it. .transform, like it's used in scikit-learn, it not meant to update the underlying model. Instead, if you want to update the model, I would advise using either online topic modeling or the .merge_model technique.

ShakilMahmudShuvo commented 3 months ago

@MaartenGr In my case, new data comes in every two days. So in this case I am planning to:

  1. Load the existing model
  2. Update the model using Online Topic Modeling.
  3. Save the model

Is this way correct ? Or is there any other easier way ? Thanks in advance

MaartenGr commented 3 months ago

You can only do this if step 1 was also done with online topic modeling. You cannot use .partial_fit after .fit at the moment. Instead, I would advise using the .merge_models technique to iteratively combine new models.