MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

topic_model.transform(training_data) results in KeyError: 6467 #2161

Closed registjl closed 1 month ago

registjl commented 1 month ago

Discussed in https://github.com/MaartenGr/BERTopic/discussions/2160

Originally posted by **registjl** September 26, 2024 Hi -- I'm new to NLP and loving BERTopic! I've created a model 'topic_model'. topic_model.fit_transform(training_data) works just fine and results in a model that I've saved to disk. When I load the model in another script, I ran the 'training_data' back through the model to see what I get the same results. - 'fit_transform' runs ok. - However, topic_model.transform(training_data) returns KeyError: 6467. I'm assuming that the "topic_model.transform' statement should return the topics associated with the data that I pass to it. Is this correct? Any ideas/guidance are greatly appreciated! JLR
MaartenGr commented 1 month ago

Thanks for converting this into an issue! Sorry to be a bit more annoying but I'll need some more information. Which version of BERTopic do you have? Also, could you provide the full code? That includes both fitting the model as well as how you saved and loaded it again.

Lastly, could you provide the full error log?

registjl commented 1 month ago

Hi Maarten - thanks for your help. Let me know if you need any additional info.

- CODE TO FIT THE MODEL (I'm limited in what I can share):

from bertopic import BERTopic
from transformers import AutoModel, AutoTokenizer

embedding_model = AutoModel.from_pretrained("cardiffnlp/tweet-topic-21-multi")
iteration_sel = f"163"
n_neighbors = 10
n_components = 2
min_dist = 0.0
min_cluster_size = 15
min_samples = 1

umap_model = UMAP(n_neighbors=n_neighbors, n_components=n_components, min_dist=min_dist)
hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples, prediction_data=True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model,
                       calculate_probabilities=True, embedding_model=embedding_model)
topics, probabilities = topic_model.fit_transform(TRAIN_DATA['clean_text'])
model_name = "./BERTopic_trained_model_163"
topic_model.save(model_name)

- CODE WHICH LOADS AND EXECUTES THE SAVED MODEL topic_model_name = f'./BERTopic_trained_model_163' topic_model = BERTopic.load(topic_model_name) topics, probabilities = topic_model.transform(TEST_DATA['clean_text'])

Traceback (most recent call last): File "C:\Users....\venv\lib\site-packages\pandas\core\indexes\base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 2606, in pandas._libs.hashtable.Int64HashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 2630, in pandas._libs.hashtable.Int64HashTable.get_item KeyError: 88 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\Users....\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3577, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 193, in topics, probabilities = topic_model.transform(TEST_DATA['clean_text']) File "C:\Users....\venv\lib\site-packages\bertopic_bertopic.py", line 578, in transform embeddings = self._extract_embeddings(documents, images=images, method="document", verbose=self.verbose) File "C:\Users....\venv\lib\site-packages\bertopic_bertopic.py", line 3676, in _extract_embeddings embeddings = self.embedding_model.embed_documents(documents, verbose=verbose) File "C:\Users...\venv\lib\site-packages\bertopic\backend_base.py", line 62, in embed_documents return self.embed(document, verbose) File "C:\Users....\venv\lib\site-packages\bertopic\backend_sentencetransformers.py", line 65, in embed embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose) File "C:\Users....\venv\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 157, in encode sentences_sorted = [sentences[idx] for idx in length_sorted_idx] File "C:\Users....\venv\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 157, in sentences_sorted = [sentences[idx] for idx in length_sorted_idx] File "C:\Users....\venv\lib\site-packages\pandas\core\series.py", line 1121, in getitem return self._get_value(key) File "C:\Users....\venv\lib\site-packages\pandas\core\series.py", line 1237, in _get_value loc = self.index.get_loc(label) File "C:\Users....\venv\lib\site-packages\pandas\core\indexes\base.py", line 3812, in get_loc raise KeyError(key) from err KeyError: 88

MaartenGr commented 1 month ago

Hmmm, I'm not entirely sure what is happening here. Did you make sure that the environments in which you load and save the model are identical? When you use pickle to save a model, it is important that you use version control to exactly reproduce the training environment.

registjl commented 1 month ago

Thanks for getting back to me, Maarten. I'm using PyCharm for my development, and I created the script to build/save the model and the script to load and transform the model in the same "PyCharm Project", i.e., I didn't create a new environment (as far as I know).

Let you know!

registjl commented 1 month ago

Hi Maarten --

I THINK I FOUND THE PROBLEM: I CONVERTED THE test_data['text'] DATAFRAME TO A LIST and IT APPEARS TO WORK!

MaartenGr commented 1 month ago

That's great! Glad to hear that it worked 😄