Closed pinnareet closed 1 month ago
There is a lot going on in your notebook, so it's difficult for me to understand what exactly is happening here. However, looking at the error message and how you saved the model it seems like there is an issue with your environment.
When you do this:
model.save(f"allMin{clusterSize}.h5")
You save the model using pickle
and not the .h5
format that you specified. When you use pickle
to save and load any model (not just BERTopic, same applies to scikit-learn for instance), you will need to make sure that the environment in which you saved and loaded the model are exactly the same. So the same python version, dependencies (including BERTopic, transformers, numpy, etc), and even the OS. Pickle is finicky like that since it makes an exact copy of the current state.
When you cannot perfectly control the environment, I would advise using either safetensors
or pickle
instead as is mentioned in the best practices guide. Also, there were a bunch of fixes since 0.16.0, so installing v0.16.2 would be preferred. I believe it may also affect choosing an embedding model when loading a saved BERTopic instance.
"I would advise using either safetensors or pickle" but in the paragraph before you were saying that I should not save with pickle. Could you please clarify?
By the way, I need the model to replicate exactly the same results every time. That's why saving it as safetensors may not work...
And if I were to replicate the environment, should I stick with bertopic 0.16.0 or update to 0.16.2? Thank you!
"I would advise using either safetensors or pickle" but in the paragraph before you were saying that I should not save with pickle. Could you please clarify?
Oops, a typo! I meant to say "either safetensors
or pytorch
".
By the way, I need the model to replicate exactly the same results every time. That's why saving it as safetensors may not work...
The output of BERTopic when fitting and transforming the first time using .fit_transform
will remain the same regardless. .transform
is likely to generate different results either way considering HDBSCAN does a different method of predicting new instances compared to how it assigns datapoints during training.
And if I were to replicate the environment, should I stick with bertopic 0.16.0 or update to 0.16.2? Thank you!
Definitely, the latest version, which is at the moment 0.16.2. Note that it is not sufficient to only pint BERTopic, you will have to pin every version in your environment since pickle (to a certain extent) expects the same environment.
Error from this line topicdistr, = model.approximate_distribution(reviewsList, use_embedding_model=True)
Please reproduce here: https://colab.research.google.com/drive/1ncTM6lk6ZOfnwSFr_jaXL2zwwvWA-JRj?usp=sharing
Run using Google Colaboratory
transformers version: 4.41.0 Platform: Linux-6.1.85+-x86_64-with-glibc2.35 Python version: 3.10.12 Huggingface_hub version: 0.23.0 Safetensors version: 0.4.3 Accelerate version: not installed Accelerate config: not found PyTorch version (GPU?): 2.3.0+cu121 (True) Tensorflow version (GPU?): 2.15.0 (True) Flax version (CPU?/GPU?/TPU?): 0.8.3 (gpu) Jax version: 0.4.26 JaxLib version: 0.4.26 Using GPU in script?: Yes Using distributed or parallel set-up in script?: No