Closed KeeratKG closed 3 months ago
I'm running into the same issue. The codes were working three weeks ago.
I just created a PR that should resolve this issue, could you test whether it works for you? If so, I will go ahead and create a new release (0.16.4) since this affects the core functionality of BERTopic.
This doesn't solve the problem for me. I did install from the branch: pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100
.
I'm training the model the following way:
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto")
topic_model = topic_model.fit(docs, embeds)
path = Path(f"{save_dir}/model.bin")
topic_model.save(path.as_posix(), serialization="pickle")
I get the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[11], line 1
----> 1 topic_model = train_model()
Cell In[10], line 30
28 # Pass the above models to be used in BERTopic
29 topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto")
---> 30 topic_model = topic_model.fit(docs, embeds)
31 path = Path(f"{save_dir}/model.bin")
32 topic_model.save(path.as_posix(), serialization="pickle")
File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y)
322 def fit(
323 self,
324 documents: List[str],
(...)
327 y: Union[List[int], np.ndarray] = None,
328 ):
329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
330
331 Arguments:
(...)
362 ```
363 """
--> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
365 return self
File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y)
494 # Reduce topics
495 if self.nr_topics:
--> 496 documents = self._reduce_topics(documents)
498 # Save the top 3 most representative documents per topic
499 self._save_representative_docs(documents)
File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf)
4345 documents = self._reduce_to_n_topics(documents, use_ctfidf)
4346 elif isinstance(self.nr_topics, str):
-> 4347 documents = self._auto_reduce_topics(documents, use_ctfidf)
4348 else:
4349 raise ValueError("nr_topics needs to be an int or 'auto'! ")
File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4502, in BERTopic._auto_reduce_topics(self, documents, use_ctfidf)
4500 self.topic_mapper_.add_mappings(mapped_topics)
4501 documents = self._sort_mappings_by_frequency(documents)
-> 4502 self._extract_topics(documents, mappings=mappings)
4503 self._update_topic_size(documents)
4504 return documents
File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:3985, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
3983 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
3984 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
3986 if verbose:
3987 logger.info("Representation - Completed \u2713")
File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings)
4119 topic_embeddings_dict = {}
4120 for topic_to, topics_from in mappings.items():
-> 4121 topic_ids = topics_from["topics_from"]
4122 topic_sizes = topics_from["topic_sizes"]
4123 if topic_ids:
KeyError: 'topics_from'
The fix did not work for me either unfortunately!
I have the same problem using the number of topics= auto
Does anybody have a fully reproducible example (data included)? I ask because when I run the following after installing the fix from the related PR, I get no errors:
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP
# Extract abstracts to train on and corresponding titles
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:10_000]
# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
# Use sub-models
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
nr_topics="auto",
verbose=True
)
topic_model = topic_model.fit(abstracts, embeddings)
Dear MaartenGr, thank you for sharing the codes. Unfortunately, it does not work for the case when using a pipeline to run BERTopic for non-English text data.
To be specific, now I have the same problem (KeyError: 'topics_from') whenever trying to use the BERTopic commands. The commands worked well several weeks ago, but I don't know why it does not work now.. Since my data is not written in English, I am using a pipeline for my pre-trained model, as shown below.
"from transformers.pipelines import pipeline
pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")"
In this case, the suggested commands did not work. If I copied the suggested commands and implemented them in my Python (in other words, if I try not to use my original pipeline but to use 'SentenceTransformer("all-MiniLM-L6-v2")', then the error appears like below.
ValueError Traceback (most recent call last) Input In [24], in <cell line: 7>() 1 topic_model = BERTopic( 2 umap_model=umap_model, 3 hdbscan_model=hdbscan_model, 4 nr_topics="auto", 5 verbose=True 6 ) ----> 7 topic_model = topic_model.fit(documents, embeddings)
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y) 322 def fit( 323 self, 324 documents: List[str], (...) 327 y: Union[List[int], np.ndarray] = None, 328 ): 329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics. 330 331 Arguments: (...) 362 ``` 363 """ --> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images) 365 return self
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:492, in BERTopic.fit_transform(self, documents, embeddings, images, y) 489 self._save_representative_docs(custom_documents) 490 else: 491 # Extract topics by calculating c-TF-IDF --> 492 self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose) 494 # Reduce topics 495 if self.nr_topics:
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:3983, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose) 3981 logger.info("Representation - Extracting topics from clusters using representation models.") 3982 documents_per_topic = documents.groupby(["Topic"], as_index=False).agg({"Document": " ".join}) -> 3983 self.c_tfidf, words = self._c_tf_idf(documents_per_topic) 3984 self.topicrepresentations = self._extract_words_per_topic(words, documents) 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4194, in BERTopic._c_tf_idf(self, documents_per_topic, fit, partial_fit) 4192 X = self.vectorizer_model.partial_fit(documents).update_bow(documents) 4193 elif fit: -> 4194 X = self.vectorizer_model.fit_transform(documents) 4195 else: 4196 X = self.vectorizer_model.transform(documents)
File ~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:1330, in CountVectorizer.fit_transform(self, raw_documents, y) 1322 warnings.warn( 1323 "Upper case characters found in" 1324 " vocabulary while 'lowercase'" 1325 " is True. These entries will not" 1326 " be matched with any documents" 1327 ) 1328 break -> 1330 vocabulary, X = self._count_vocab(raw_documents, self.fixedvocabulary) 1332 if self.binary: 1333 X.data.fill(1)
File ~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:1220, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab) 1218 vocabulary = dict(vocabulary) 1219 if not vocabulary: -> 1220 raise ValueError( 1221 "empty vocabulary; perhaps the documents only contain stop words" 1222 ) 1224 if indptr[-1] > np.iinfo(np.int32).max: # = 2**31 - 1 1225 if _IS_32BIT:
ValueError: empty vocabulary; perhaps the documents only contain stop words
What should I do to solve this problem? T.T (Please understand that I cannot upload the data... But still the KeyError appears... Please help...)
@jlee9095 I'm a bit confused. Are you saying that you have two separate issues? Because you mentioned that running the code I provided did not work for you. Could you share your full code to showcase both issues? Also, I'm not able to reproduce the issue so if you can reproduce the issue with dummy data (like the data I shared), I can easier figure out what is wrong.
@MaartenGr the fix #2101 works for me, thank you! Happy to leave this issue open if y'all want to discuss more.
I just created a PR that should resolve this issue, could you test whether it works for you? If so, I will go ahead and create a new release (0.16.4) since this affects the core functionality of BERTopic.
Yes please.
@MaartenGr Thank you for your response. Yes, I have two separate issues. The errors that I uploaded above appear whenever I try to run your suggested commands as they are (that is, when using 'Sentence Transformer'). As an alternative, if I try to use my original pipeline from hugging face, then the error appears when running the 'embeddings = embedding_model.encode(documents, show_progress_bar=True)' command. Below are the commands and the errors for the second case.
(Commands for the case using the pipeline from hugging face) import pandas as pd
docu = pd.read_csv('C:/Users/BERTopic/after_preprocessing.csv', engine='python') len(docu)
documents = docu['text'].to_list()
from sentence_transformers import SentenceTransformer from bertopic import BERTopic from hdbscan import HDBSCAN from umap import UMAP
from transformers.pipelines import pipeline
pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")
embedding_model = pretrained_model embeddings = embedding_model.encode(documents, show_progress_bar=True)
AttributeError Traceback (most recent call last) Input In [20], in <cell line: 2>() 1 embedding_model = pretrained_model ----> 2 embeddings = embedding_model.encode(documents, show_progress_bar=True)
I am sorry that I am troubling to find a good example data, but I'll do my best to figure it out as well.
@MaartenGr Hi, here are two cases that I tested using the example data.
[Case 1. Commands]
from sentence_transformers import SentenceTransformer from datasets import load_dataset from bertopic import BERTopic from hdbscan import HDBSCAN from umap import UMAP
dataset = load_dataset('klue','sts')["train"] abstracts = dataset['sentence1'][:1000]
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42) hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)
topic_model = BERTopic( umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto", verbose=True ) topic_model = topic_model.fit(abstracts, embeddings)
--------------------------------------------------------------------------- Then, I got the error like below.
KeyError Traceback (most recent call last) Input In [7], in <cell line: 26>() 19 # Pass the above models to be used in BERTopic 20 topic_model = BERTopic( 21 umap_model=umap_model, 22 hdbscan_model=hdbscan_model, 23 nr_topics="auto", 24 verbose=True 25 ) ---> 26 topic_model = topic_model.fit(abstracts, embeddings)
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y) 322 def fit( 323 self, 324 documents: List[str], (...) 327 y: Union[List[int], np.ndarray] = None, 328 ): 329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics. 330 331 Arguments: (...) 362 ``` 363 """ --> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images) 365 return self
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y) 494 # Reduce topics 495 if self.nr_topics: --> 496 documents = self._reduce_topics(documents) 498 # Save the top 3 most representative documents per topic 499 self._save_representative_docs(documents)
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf) 4345 documents = self._reduce_to_n_topics(documents, use_ctfidf) 4346 elif isinstance(self.nr_topics, str): -> 4347 documents = self._auto_reduce_topics(documents, use_ctfidf) 4348 else: 4349 raise ValueError("nr_topics needs to be an int or 'auto'! ")
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4502, in BERTopic._auto_reduce_topics(self, documents, use_ctfidf) 4500 self.topicmapper.add_mappings(mapped_topics) 4501 documents = self._sort_mappings_by_frequency(documents) -> 4502 self._extract_topics(documents, mappings=mappings) 4503 self._update_topic_size(documents) 4504 return documents
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:3985, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose) 3983 self.c_tfidf, words = self._c_tf_idf(documents_per_topic) 3984 self.topicrepresentations = self._extract_words_per_topic(words, documents) -> 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings) 3986 if verbose: 3987 logger.info("Representation - Completed \u2713")
File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings) 4119 topic_embeddings_dict = {} 4120 for topic_to, topics_from in mappings.items(): -> 4121 topic_ids = topics_from["topics_from"] 4122 topic_sizes = topics_from["topic_sizes"] 4123 if topic_ids:
KeyError: 'topics_from'
[Case 2. Commands]
from sentence_transformers import SentenceTransformer from datasets import load_dataset from bertopic import BERTopic from hdbscan import HDBSCAN from umap import UMAP
dataset = load_dataset('klue','sts')["train"] abstracts = dataset['sentence1'][:1000]
from transformers.pipelines import pipeline
pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")
embedding_model = pretrained_model embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42) hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)
topic_model = BERTopic( umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto", verbose=True ) topic_model = topic_model.fit(abstracts, embeddings)
------------------------------------------------------------------------- Then, I got the error like below.
AttributeError Traceback (most recent call last) Input In [14], in <cell line: 17>() 14 pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base") 16 embedding_model = pretrained_model ---> 17 embeddings = embedding_model.encode(abstracts, show_progress_bar=True) 19 # Use sub-models 20 umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
AttributeError: 'FeatureExtractionPipeline' object has no attribute 'encode'
How can I solve this problem..? All your help will be greatly appreciated...
@jlee9095 The second example does not seem related to this particular issue. Generally, I would advise opening up a new issue for that but it seems that you are using the encode
function which is not supported for a Hugging Face pipeline. Please refer to the pipeline documentation of HF on how to extract embeddings.
With respect to your first problem, it seems that the PR I linked resolves the problem. Make sure that when you install that PR, that you are certain the PR is properly installed and that you are not using the official release.
For the error “[KeyError: 'topics_from']”,I download the lower edition 0.16.0 and solve this problem successfully.
When I set the nr_topics="auto"
parameter, I encounter the following error:
topic_model = BERTopic(
embedding_model=sentence_model,
vectorizer_model=vectorizer_model,
# min_topic_size = 100, # Split sentences "All"
nr_topics="auto", # Automatically detect the number of topics
# nr_topics = 10, #40, # Limit the total number of topics
top_n_words=10, # Use the top n words
calculate_probabilities=True,
umap_model=umap_model, # Fix UMAP random state
hdbscan_model=hdbscan_model # Set HDBSCAN model
)
When I comment out the line nr_topics="auto"
, the error does not occur. However, when I set this parameter to 'auto', I get a KeyError: 'topics_from'
. When set nr_topics=10
the code run properly.
@smbslt3 Have you tried the PR that I shared above? In my experience, it should fix the issue.
@MaartenGr Hi Maarten! I can't speak on behalf of @smbslt3 but I was experiencing the same issue and the changes to bertopic.py in #2101 fixed the issue for me.
It may also be worth noting to anybody that is still facing this issue that if you installed this library through pip and are trying to update by doing something along the lines of pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100
like @abhinavkulkarni was, this didn't actually update any code for me and I had to manually change the few lines of code in my local site-packages folder in my Anaconda environment.
Once this change is included in an official release (0.16.4) I'd assume that simply running pip install bertopic==0.16.4
will fix the issue for anyone using pip and still experiencing this issue.
I'm having the same issue, KeyError: 'topics_from', my workaround is pip install bertopic==0.16.2. It can be seen that there is a problem with the new version 0.16.3, and I hope to fix it in the next version.
To everyone facing this issue, make sure you do not have BERTopic installed before you run pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100
. This should install the related PR (#2101) and solve the issue.
Based on this thread, I can confirm that if the PR is correctly installed, it should solve the issue. I intend to release a new version whenever #2105 is also merged into the main branch.
To everyone facing this issue, make sure you do not have BERTopic installed before you run
pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100
. This should install the related PR (#2101) and solve the issue.Based on this thread, I can confirm that if the PR is correctly installed, it should solve the issue. I intend to release a new version whenever #2105 is also merged into the main branch.
I also have same issue. Due to your help, I can fix this problem. Thank you. I hope this bug be solved in 0.16.4 version.
Have you searched existing issues? 🔎
Desribe the bug
When trying to run
topics, probs = TM.fit_transform(docs)
wheredocs
is a list of strings (we want to cluster topics based on these strings), I run into the following error:This happens after the following steps of training have already taken place:
Reproduction
BERTopic Version
0.16.13