MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.21k stars 767 forks source link

Loading saved model error problem #804

Closed satrio-yudhoatmojo closed 1 year ago

satrio-yudhoatmojo commented 2 years ago

I ran fit and transform on a different machine that has high specification and save the model and I also reduced the topics and save the result.

But, when I try to load the saved models in my computer, I got this error:

from bertopic import BERTopic
 topic_model = BERTopic.load("mymodel.bin")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\****\AppData\Local\Programs\Python\Python310\lib\site-packages\bertopic\_bertopic.py", line 2197, in load
    topic_model = joblib.load(file)
  File "C:\Users\\****\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\numpy_pickle.py", line 577, in load
    obj = _unpickle(fobj)
  File "C:\Users\\****\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\numpy_pickle.py", line 506, in _unpickle
    obj = unpickler.load()
  File "C:\Users\\****\AppData\Local\Programs\Python\Python310\lib\pickle.py", line 1213, in load
    dispatch[key[0]](self)
  File "C:\Users\\****\AppData\Local\Programs\Python\Python310\lib\pickle.py", line 1590, in load_reduce
    stack[-1] = func(*args)
  File "C:\Users\\****\AppData\Local\Programs\Python\Python310\lib\site-packages\numba\core\serialize.py", line 97, in _unpickle__CustomPickled
    ctor, states = loads(serialized)
TypeError: 'bytes' object cannot be interpreted as an integer

But, when I load them in the machine that I did the fit and transform and the reduced topics, it works just fine.

MaartenGr commented 2 years ago

I believe I have not seen this error before. However, did you make sure that the environment in which you saved the model has the same structure as the one in which you loaded the model? It is important the dependencies are kept the same throughout environments and changes might break the model.

shashankgaur3 commented 2 years ago

You need to load the file as stream, the Folder object is the location where the file resides. Below code solves the issue for me:

import shutil
with Topic_Model_folder.get_download_stream("<filename>") as stream, open("<filename>,'wb') as f_local:
    shutil.copyfileobj(stream,f_local)

cu_topic_model= BERTopic.load("cu_Topic_Model")

This solves the problem of loading it, but the process fails while predicting the new dataset, and it works if both fit and transform run on the same session. But it fails if you just do transform in a different session. I am using the same code env for both processes. Steps followed:

  1. fit_transform using cuml HDBSCAN and UMAP
  2. Save the model
  3. Load the Model
  4. Use approximate_predict to score (https://github.com/MaartenGr/BERTopic/issues/732#issuecomment-1292655109)

If you run all 4 steps in one session it works, but if you fit and Save with one session the the approximate_predict function fails while scoring with the below error:

[23:23:34] [INFO] [dku.utils] - [2022-10-25 23:23:33,065] [39/MainThread] [DEBUG] [filelock] Lock 140316465655712 released on /opt/dataiku/code-env/resources/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/modules.json.lock [23:23:34] [INFO] [dku.utils] - [2022-10-25 23:23:33,366] [39/MainThread] [INFO] [sentence_transformers.SentenceTransformer] Use pytorch device: cuda [23:23:39] [INFO] [dku.utils] - [2022-10-25 23:23:39,733] [1/MainThread] [ERROR] [root] Containerized process terminated by signal 117 [23:23:39] [INFO] [dku.utils] - [2022-10-25 23:23:39,734] [1/MainThread] [INFO] [root] Sending error.json to backend/JEK [23:23:39] [INFO] [dku.utils] - [2022-10-25 23:23:39,734] [1/MainThread] [DEBUG] [root] Verifying SSL calls with certificate /home/dataiku/rpc_server_cert.pem

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue for now. Feel free to reach out if you want to continue this discussion or re-open the issue!