MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

KeyError: 'topics_from' #2100

Closed KeeratKG closed 1 month ago

KeeratKG commented 2 months ago

Have you searched existing issues? 🔎

Desribe the bug

When trying to run topics, probs = TM.fit_transform(docs) where docs is a list of strings (we want to cluster topics based on these strings), I run into the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 1
----> 1 topics, probs = TM.fit_transform(docs)

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    494 # Reduce topics
    495 if self.nr_topics:
--> 496     documents = self._reduce_topics(documents)
    498 # Save the top 3 most representative documents per topic
    499 self._save_representative_docs(documents)

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf)
   4345         documents = self._reduce_to_n_topics(documents, use_ctfidf)
   4346 elif isinstance(self.nr_topics, str):
-> 4347     documents = self._auto_reduce_topics(documents, use_ctfidf)
   4348 else:
   4349     raise ValueError("nr_topics needs to be an int or 'auto'! ")

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:4502, in BERTopic._auto_reduce_topics(self, documents, use_ctfidf)
   4500 self.topic_mapper_.add_mappings(mapped_topics)
   4501 documents = self._sort_mappings_by_frequency(documents)
-> 4502 self._extract_topics(documents, mappings=mappings)
   4503 self._update_topic_size(documents)
   4504 return documents

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:3985, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
   3983 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
   3984 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3986 if verbose:
   3987     logger.info("Representation - Completed \u2713")

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings)
   4119 topic_embeddings_dict = {}
   4120 for topic_to, topics_from in mappings.items():
-> 4121     topic_ids = topics_from["topics_from"]
   4122     topic_sizes = topics_from["topic_sizes"]
   4123     if topic_ids:

KeyError: 'topics_from'

This happens after the following steps of training have already taken place:

2024-07-26 18:43:39,195 - BERTopic - Embedding - Transforming documents to embeddings.
Error displaying widget: model not found
2024-07-26 18:43:55,125 - BERTopic - Embedding - Completed ✓
2024-07-26 18:43:55,126 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-07-26 18:44:21,848 - BERTopic - Dimensionality - Completed ✓
2024-07-26 18:44:21,849 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-07-26 18:44:40,617 - BERTopic - Cluster - Completed ✓
2024-07-26 18:44:40,618 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-07-26 18:45:04,160 - BERTopic - Representation - Completed ✓
2024-07-26 18:45:04,171 - BERTopic - Topic reduction - Reducing number of topics

Reproduction

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import MaximalMarginalRelevance
from umap import UMAP
from hdbscan import HDBSCAN
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from collections import Counter

class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

stopwords = list(stopwords.words('english'))

SENT_EMBEDDING = SentenceTransformer('all-MiniLM-L6-v2')
UMAP_MODEL = UMAP(n_neighbors=15, n_components=3, min_dist=0.05)
HDBSCAN_MODEL = HDBSCAN(min_cluster_size=15, prediction_data=True, gen_min_span_tree=True)
VECTORIZE_MODEL = CountVectorizer(ngram_range=(1,3), stop_words=stopwords, tokenizer=LemmaTokenizer())
ctfidf_model = ClassTfidfTransformer()
representation_model = MaximalMarginalRelevance(diversity=0.2)

TM = BERTopic(
umap_model=UMAP_MODEL,
hdbscan_model=HDBSCAN_MODEL,
embedding_model=SENT_EMBEDDING,
vectorizer_model=VECTORIZE_MODEL,
ctfidf_model=ctfidf_model,
representation_model=representation_model,
language='english',
calculate_probabilities=True,
verbose=True,
nr_topics = 'auto')

docs = ["The weather today is amazing", "It is quite unbearably hot today", "Oh this ice cream looks lovely", "Where are you?", "How are you?"] ## sample only 

topics, probs = TM.fit_transform(docs)

BERTopic Version

0.16.13

lichenzhen commented 2 months ago

I'm running into the same issue. The codes were working three weeks ago.

MaartenGr commented 2 months ago

I just created a PR that should resolve this issue, could you test whether it works for you? If so, I will go ahead and create a new release (0.16.4) since this affects the core functionality of BERTopic.

abhinavkulkarni commented 2 months ago

This doesn't solve the problem for me. I did install from the branch: pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100.

I'm training the model the following way:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto")
topic_model = topic_model.fit(docs, embeds)
path = Path(f"{save_dir}/model.bin")
topic_model.save(path.as_posix(), serialization="pickle")

I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 topic_model = train_model()

Cell In[10], line 30
     28 # Pass the above models to be used in BERTopic
     29 topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto")
---> 30 topic_model = topic_model.fit(docs, embeds)
     31 path = Path(f"{save_dir}/model.bin")
     32 topic_model.save(path.as_posix(), serialization="pickle")

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y)
    322 def fit(
    323     self,
    324     documents: List[str],
   (...)
    327     y: Union[List[int], np.ndarray] = None,
    328 ):
    329     """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
    330 
    331     Arguments:
   (...)
    362     ```
    363     """
--> 364     self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
    365     return self

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    494 # Reduce topics
    495 if self.nr_topics:
--> 496     documents = self._reduce_topics(documents)
    498 # Save the top 3 most representative documents per topic
    499 self._save_representative_docs(documents)

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf)
   4345         documents = self._reduce_to_n_topics(documents, use_ctfidf)
   4346 elif isinstance(self.nr_topics, str):
-> 4347     documents = self._auto_reduce_topics(documents, use_ctfidf)
   4348 else:
   4349     raise ValueError("nr_topics needs to be an int or 'auto'! ")

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4502, in BERTopic._auto_reduce_topics(self, documents, use_ctfidf)
   4500 self.topic_mapper_.add_mappings(mapped_topics)
   4501 documents = self._sort_mappings_by_frequency(documents)
-> 4502 self._extract_topics(documents, mappings=mappings)
   4503 self._update_topic_size(documents)
   4504 return documents

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:3985, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
   3983 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
   3984 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3986 if verbose:
   3987     logger.info("Representation - Completed \u2713")

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings)
   4119 topic_embeddings_dict = {}
   4120 for topic_to, topics_from in mappings.items():
-> 4121     topic_ids = topics_from["topics_from"]
   4122     topic_sizes = topics_from["topic_sizes"]
   4123     if topic_ids:

KeyError: 'topics_from'
ellenlnt commented 2 months ago

The fix did not work for me either unfortunately!

KlausikPL commented 2 months ago

I have the same problem using the number of topics= auto

MaartenGr commented 2 months ago

Does anybody have a fully reproducible example (data included)? I ask because when I run the following after installing the fix from the related PR, I get no errors:

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

# Extract abstracts to train on and corresponding titles
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:10_000]

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

# Use sub-models
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(
    umap_model=umap_model, 
    hdbscan_model=hdbscan_model, 
    nr_topics="auto",
    verbose=True
)
topic_model = topic_model.fit(abstracts, embeddings)
jlee9095 commented 2 months ago

Dear MaartenGr, thank you for sharing the codes. Unfortunately, it does not work for the case when using a pipeline to run BERTopic for non-English text data.

To be specific, now I have the same problem (KeyError: 'topics_from') whenever trying to use the BERTopic commands. The commands worked well several weeks ago, but I don't know why it does not work now.. Since my data is not written in English, I am using a pipeline for my pre-trained model, as shown below.

"from transformers.pipelines import pipeline

pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")"

In this case, the suggested commands did not work. If I copied the suggested commands and implemented them in my Python (in other words, if I try not to use my original pipeline but to use 'SentenceTransformer("all-MiniLM-L6-v2")', then the error appears like below.


ValueError Traceback (most recent call last) Input In [24], in <cell line: 7>() 1 topic_model = BERTopic( 2 umap_model=umap_model, 3 hdbscan_model=hdbscan_model, 4 nr_topics="auto", 5 verbose=True 6 ) ----> 7 topic_model = topic_model.fit(documents, embeddings)

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y) 322 def fit( 323 self, 324 documents: List[str], (...) 327 y: Union[List[int], np.ndarray] = None, 328 ): 329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics. 330 331 Arguments: (...) 362 ``` 363 """ --> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images) 365 return self

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:492, in BERTopic.fit_transform(self, documents, embeddings, images, y) 489 self._save_representative_docs(custom_documents) 490 else: 491 # Extract topics by calculating c-TF-IDF --> 492 self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose) 494 # Reduce topics 495 if self.nr_topics:

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:3983, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose) 3981 logger.info("Representation - Extracting topics from clusters using representation models.") 3982 documents_per_topic = documents.groupby(["Topic"], as_index=False).agg({"Document": " ".join}) -> 3983 self.c_tfidf, words = self._c_tf_idf(documents_per_topic) 3984 self.topicrepresentations = self._extract_words_per_topic(words, documents) 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4194, in BERTopic._c_tf_idf(self, documents_per_topic, fit, partial_fit) 4192 X = self.vectorizer_model.partial_fit(documents).update_bow(documents) 4193 elif fit: -> 4194 X = self.vectorizer_model.fit_transform(documents) 4195 else: 4196 X = self.vectorizer_model.transform(documents)

File ~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:1330, in CountVectorizer.fit_transform(self, raw_documents, y) 1322 warnings.warn( 1323 "Upper case characters found in" 1324 " vocabulary while 'lowercase'" 1325 " is True. These entries will not" 1326 " be matched with any documents" 1327 ) 1328 break -> 1330 vocabulary, X = self._count_vocab(raw_documents, self.fixedvocabulary) 1332 if self.binary: 1333 X.data.fill(1)

File ~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:1220, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab) 1218 vocabulary = dict(vocabulary) 1219 if not vocabulary: -> 1220 raise ValueError( 1221 "empty vocabulary; perhaps the documents only contain stop words" 1222 ) 1224 if indptr[-1] > np.iinfo(np.int32).max: # = 2**31 - 1 1225 if _IS_32BIT:

ValueError: empty vocabulary; perhaps the documents only contain stop words


What should I do to solve this problem? T.T (Please understand that I cannot upload the data... But still the KeyError appears... Please help...)

MaartenGr commented 2 months ago

@jlee9095 I'm a bit confused. Are you saying that you have two separate issues? Because you mentioned that running the code I provided did not work for you. Could you share your full code to showcase both issues? Also, I'm not able to reproduce the issue so if you can reproduce the issue with dummy data (like the data I shared), I can easier figure out what is wrong.

KeeratKG commented 2 months ago

@MaartenGr the fix #2101 works for me, thank you! Happy to leave this issue open if y'all want to discuss more.

I just created a PR that should resolve this issue, could you test whether it works for you? If so, I will go ahead and create a new release (0.16.4) since this affects the core functionality of BERTopic.

Yes please.

jlee9095 commented 2 months ago

@MaartenGr Thank you for your response. Yes, I have two separate issues. The errors that I uploaded above appear whenever I try to run your suggested commands as they are (that is, when using 'Sentence Transformer'). As an alternative, if I try to use my original pipeline from hugging face, then the error appears when running the 'embeddings = embedding_model.encode(documents, show_progress_bar=True)' command. Below are the commands and the errors for the second case.

(Commands for the case using the pipeline from hugging face) import pandas as pd

docu = pd.read_csv('C:/Users/BERTopic/after_preprocessing.csv', engine='python') len(docu)

documents = docu['text'].to_list()

from sentence_transformers import SentenceTransformer from bertopic import BERTopic from hdbscan import HDBSCAN from umap import UMAP

from transformers.pipelines import pipeline

pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")

embedding_model = pretrained_model embeddings = embedding_model.encode(documents, show_progress_bar=True)

(Then, the below error appears)

AttributeError Traceback (most recent call last) Input In [20], in <cell line: 2>() 1 embedding_model = pretrained_model ----> 2 embeddings = embedding_model.encode(documents, show_progress_bar=True)

AttributeError: 'FeatureExtractionPipeline' object has no attribute 'encode'

I am sorry that I am troubling to find a good example data, but I'll do my best to figure it out as well.

jlee9095 commented 2 months ago

@MaartenGr Hi, here are two cases that I tested using the example data.

[Case 1. Commands]

from sentence_transformers import SentenceTransformer from datasets import load_dataset from bertopic import BERTopic from hdbscan import HDBSCAN from umap import UMAP

dataset = load_dataset('klue','sts')["train"] abstracts = dataset['sentence1'][:1000]

embedding_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42) hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)

topic_model = BERTopic( umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto", verbose=True ) topic_model = topic_model.fit(abstracts, embeddings)

--------------------------------------------------------------------------- Then, I got the error like below.

KeyError Traceback (most recent call last) Input In [7], in <cell line: 26>() 19 # Pass the above models to be used in BERTopic 20 topic_model = BERTopic( 21 umap_model=umap_model, 22 hdbscan_model=hdbscan_model, 23 nr_topics="auto", 24 verbose=True 25 ) ---> 26 topic_model = topic_model.fit(abstracts, embeddings)

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y) 322 def fit( 323 self, 324 documents: List[str], (...) 327 y: Union[List[int], np.ndarray] = None, 328 ): 329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics. 330 331 Arguments: (...) 362 ``` 363 """ --> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images) 365 return self

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y) 494 # Reduce topics 495 if self.nr_topics: --> 496 documents = self._reduce_topics(documents) 498 # Save the top 3 most representative documents per topic 499 self._save_representative_docs(documents)

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf) 4345 documents = self._reduce_to_n_topics(documents, use_ctfidf) 4346 elif isinstance(self.nr_topics, str): -> 4347 documents = self._auto_reduce_topics(documents, use_ctfidf) 4348 else: 4349 raise ValueError("nr_topics needs to be an int or 'auto'! ")

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4502, in BERTopic._auto_reduce_topics(self, documents, use_ctfidf) 4500 self.topicmapper.add_mappings(mapped_topics) 4501 documents = self._sort_mappings_by_frequency(documents) -> 4502 self._extract_topics(documents, mappings=mappings) 4503 self._update_topic_size(documents) 4504 return documents

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:3985, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose) 3983 self.c_tfidf, words = self._c_tf_idf(documents_per_topic) 3984 self.topicrepresentations = self._extract_words_per_topic(words, documents) -> 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings) 3986 if verbose: 3987 logger.info("Representation - Completed \u2713")

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings) 4119 topic_embeddings_dict = {} 4120 for topic_to, topics_from in mappings.items(): -> 4121 topic_ids = topics_from["topics_from"] 4122 topic_sizes = topics_from["topic_sizes"] 4123 if topic_ids:

KeyError: 'topics_from'


[Case 2. Commands]

from sentence_transformers import SentenceTransformer from datasets import load_dataset from bertopic import BERTopic from hdbscan import HDBSCAN from umap import UMAP

dataset = load_dataset('klue','sts')["train"] abstracts = dataset['sentence1'][:1000]

from transformers.pipelines import pipeline

pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")

embedding_model = pretrained_model embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42) hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)

topic_model = BERTopic( umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto", verbose=True ) topic_model = topic_model.fit(abstracts, embeddings)

------------------------------------------------------------------------- Then, I got the error like below.

AttributeError Traceback (most recent call last) Input In [14], in <cell line: 17>() 14 pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base") 16 embedding_model = pretrained_model ---> 17 embeddings = embedding_model.encode(abstracts, show_progress_bar=True) 19 # Use sub-models 20 umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)

AttributeError: 'FeatureExtractionPipeline' object has no attribute 'encode'


How can I solve this problem..? All your help will be greatly appreciated...

MaartenGr commented 2 months ago

@jlee9095 The second example does not seem related to this particular issue. Generally, I would advise opening up a new issue for that but it seems that you are using the encode function which is not supported for a Hugging Face pipeline. Please refer to the pipeline documentation of HF on how to extract embeddings.

With respect to your first problem, it seems that the PR I linked resolves the problem. Make sure that when you install that PR, that you are certain the PR is properly installed and that you are not using the official release.

WJG100 commented 2 months ago

For the error “[KeyError: 'topics_from']”,I download the lower edition 0.16.0 and solve this problem successfully.

smbslt3 commented 2 months ago

When I set the nr_topics="auto" parameter, I encounter the following error:

topic_model = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    # min_topic_size = 100,   # Split sentences "All"
    nr_topics="auto",  # Automatically detect the number of topics
    # nr_topics = 10, #40,   # Limit the total number of topics
    top_n_words=10,   # Use the top n words
    calculate_probabilities=True,
    umap_model=umap_model,  # Fix UMAP random state
    hdbscan_model=hdbscan_model  # Set HDBSCAN model
)

When I comment out the line nr_topics="auto", the error does not occur. However, when I set this parameter to 'auto', I get a KeyError: 'topics_from'. When set nr_topics=10 the code run properly.

MaartenGr commented 2 months ago

@smbslt3 Have you tried the PR that I shared above? In my experience, it should fix the issue.

Izaac-Thomas commented 2 months ago

@MaartenGr Hi Maarten! I can't speak on behalf of @smbslt3 but I was experiencing the same issue and the changes to bertopic.py in #2101 fixed the issue for me.

It may also be worth noting to anybody that is still facing this issue that if you installed this library through pip and are trying to update by doing something along the lines of pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100 like @abhinavkulkarni was, this didn't actually update any code for me and I had to manually change the few lines of code in my local site-packages folder in my Anaconda environment.

Once this change is included in an official release (0.16.4) I'd assume that simply running pip install bertopic==0.16.4 will fix the issue for anyone using pip and still experiencing this issue.

Yif18 commented 2 months ago

I'm having the same issue, KeyError: 'topics_from', my workaround is pip install bertopic==0.16.2. It can be seen that there is a problem with the new version 0.16.3, and I hope to fix it in the next version.

MaartenGr commented 2 months ago

To everyone facing this issue, make sure you do not have BERTopic installed before you run pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100. This should install the related PR (#2101) and solve the issue.

Based on this thread, I can confirm that if the PR is correctly installed, it should solve the issue. I intend to release a new version whenever #2105 is also merged into the main branch.

kungmo commented 1 month ago

To everyone facing this issue, make sure you do not have BERTopic installed before you run pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100. This should install the related PR (#2101) and solve the issue.

Based on this thread, I can confirm that if the PR is correctly installed, it should solve the issue. I intend to release a new version whenever #2105 is also merged into the main branch.

I also have same issue. Due to your help, I can fix this problem. Thank you. I hope this bug be solved in 0.16.4 version.