false WARNING upon BERTopic.load

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

6.13k stars 764 forks source link

false WARNING upon BERTopic.load #1757

Open Neehier opened 9 months ago

Neehier commented 9 months ago

Starting on a fresh environment run, loading in a BERTopic model using load results in a false warning mentioning a missing explicit definition of embedding_model.

model = BERTopic.load(path, embedding_model="paraphrase-multilingual-MiniLM-L12-v2")

BERTopic - WARNING: You are loading a BERTopic model without explicitly defining an embedding model.If you want to also load in an embedding model, make sure to useBERTopic.load(my_model, embedding_model=my_embedding_model).

Despite the warning, the model seems to be loaded in with no issue and the embedding model works as expected.

MaartenGr commented 9 months ago

Could you share your full code for training, saving, and loading the model? Also, are you using the latest release (v0.16) or perhaps from the latest commit on the main branch itself?

Neehier commented 9 months ago

I am indeed using v0.16. The model I am loading is originally a merged model.

umap_model = UMAP(n_components=15, n_neighbors=5, min_dist=0.0)
hdbscan_model = HDBSCAN(min_cluster_size=3, prediction_data=True)
representation_model = KeyBERTInspired()

base_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True, language='multilingual', verbose=True)
base_model.fit(docs1)

base_second_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True, language='multilingual', verbose=True)
base_second_model.fit(docs2)

merged_model = BERTopic.merge_models([base_model, base_second_model])

merged_model.save(path, serialization='pytorch')

Neehier commented 9 months ago

I'm not so sure if this is relevant to add, but I am using gpu-accelerated HDBSCAN and UMAP from cuml.

MaartenGr commented 9 months ago

Hmmm, not sure what is happening here. There might be some strict checking done in .load even though you are passing an embedding model. Upon further inspection, it might be related to this:

https://github.com/MaartenGr/BERTopic/blob/6316c1e80b5247e1d31d5c41ed2d62ff4bf99e6b/bertopic/_bertopic.py#L3058

Perhaps that type checking needs to be removed. Could you test whether that works?

Neehier commented 9 months ago

Unfortunately the warning persists even after removing the typecheck. I'll do some more checks and investigation after my exams.

MaartenGr commented 9 months ago

Thanks for checking! I'll make sure to leave this open for your update.

balcse commented 7 months ago

Hello,

I have the same issue and tried to look into it but did not find a solution.

I'm using BERTopic version 0.16.0, sentence_transformers version 2.5.1. What I'm trying to do is to load a model from a directory (serialised as safetensors) and it seems that the embedding model does not get included as a parameter in the block at line 3051: https://github.com/MaartenGr/BERTopic/blob/8985f26d4ee89b4c512ff9da22a61371c20605b8/bertopic/_bertopic.py#L3138C1-L3139C118

And for this reason the try statement can't be executed and it selects the BaseEmbedder(): https://github.com/MaartenGr/BERTopic/blob/8985f26d4ee89b4c512ff9da22a61371c20605b8/bertopic/_bertopic.py#L4463C1-L4464C89

But this was just a quick check, I also did not find any really working solution, but it might help in finding a cause for the problem

MaartenGr commented 7 months ago

@balcse I believe there are a couple of fixes for this in the main branch of BERTopic. I would advise installing BERTopic from the main branch to potentially fix the issue. Do note that the embedding model is only saved if you use save_embedding_model="some_string" when saving the model. If not, then you can use the embedding_model parameter in .load.

balcse commented 7 months ago

Thanks for the quick reply, it is the main branch I'm using, I linked the wrong version in my comment. But saving the embedding model seems to work fine

GabyMU commented 2 weeks ago

Just wondering if this issue was solved?