MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.04k stars 757 forks source link

topic_model.transform(training_data) results in KeyError: 6467 #2161

Open registjl opened 5 days ago

registjl commented 5 days ago

Discussed in https://github.com/MaartenGr/BERTopic/discussions/2160

Originally posted by **registjl** September 26, 2024 Hi -- I'm new to NLP and loving BERTopic! I've created a model 'topic_model'. topic_model.fit_transform(training_data) works just fine and results in a model that I've saved to disk. When I load the model in another script, I ran the 'training_data' back through the model to see what I get the same results. - 'fit_transform' runs ok. - However, topic_model.transform(training_data) returns KeyError: 6467. I'm assuming that the "topic_model.transform' statement should return the topics associated with the data that I pass to it. Is this correct? Any ideas/guidance are greatly appreciated! JLR
MaartenGr commented 4 days ago

Thanks for converting this into an issue! Sorry to be a bit more annoying but I'll need some more information. Which version of BERTopic do you have? Also, could you provide the full code? That includes both fitting the model as well as how you saved and loaded it again.

Lastly, could you provide the full error log?

registjl commented 4 days ago

Hi Maarten - thanks for your help. Let me know if you need any additional info.

- CODE TO FIT THE MODEL (I'm limited in what I can share):

from bertopic import BERTopic
from transformers import AutoModel, AutoTokenizer

embedding_model = AutoModel.from_pretrained("cardiffnlp/tweet-topic-21-multi")
iteration_sel = f"163"
n_neighbors = 10
n_components = 2
min_dist = 0.0
min_cluster_size = 15
min_samples = 1

umap_model = UMAP(n_neighbors=n_neighbors, n_components=n_components, min_dist=min_dist)
hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples, prediction_data=True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model,
                       calculate_probabilities=True, embedding_model=embedding_model)
topics, probabilities = topic_model.fit_transform(TRAIN_DATA['clean_text'])
model_name = "./BERTopic_trained_model_163"
topic_model.save(model_name)

- CODE WHICH LOADS AND EXECUTES THE SAVED MODEL topic_model_name = f'./BERTopic_trained_model_163' topic_model = BERTopic.load(topic_model_name) topics, probabilities = topic_model.transform(TEST_DATA['clean_text'])

Traceback (most recent call last): File "C:\Users....\venv\lib\site-packages\pandas\core\indexes\base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 2606, in pandas._libs.hashtable.Int64HashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 2630, in pandas._libs.hashtable.Int64HashTable.get_item KeyError: 88 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\Users....\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3577, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 193, in topics, probabilities = topic_model.transform(TEST_DATA['clean_text']) File "C:\Users....\venv\lib\site-packages\bertopic_bertopic.py", line 578, in transform embeddings = self._extract_embeddings(documents, images=images, method="document", verbose=self.verbose) File "C:\Users....\venv\lib\site-packages\bertopic_bertopic.py", line 3676, in _extract_embeddings embeddings = self.embedding_model.embed_documents(documents, verbose=verbose) File "C:\Users...\venv\lib\site-packages\bertopic\backend_base.py", line 62, in embed_documents return self.embed(document, verbose) File "C:\Users....\venv\lib\site-packages\bertopic\backend_sentencetransformers.py", line 65, in embed embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose) File "C:\Users....\venv\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 157, in encode sentences_sorted = [sentences[idx] for idx in length_sorted_idx] File "C:\Users....\venv\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 157, in sentences_sorted = [sentences[idx] for idx in length_sorted_idx] File "C:\Users....\venv\lib\site-packages\pandas\core\series.py", line 1121, in getitem return self._get_value(key) File "C:\Users....\venv\lib\site-packages\pandas\core\series.py", line 1237, in _get_value loc = self.index.get_loc(label) File "C:\Users....\venv\lib\site-packages\pandas\core\indexes\base.py", line 3812, in get_loc raise KeyError(key) from err KeyError: 88