MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.16k stars 764 forks source link

openAI BERTopic coherence score #2139

Open TalaN1993 opened 2 months ago

TalaN1993 commented 2 months ago

Have you searched existing issues? 🔎

Desribe the bug

Hi. I am trying to perform BERTopic using gpt model represenattion. I got the topics, but my issue is the coherence. I cannot underestand what is the issue. please guide me.


ValueError Traceback (most recent call last) Cell In[20], line 33 30 topic_words_ids = [[dictionary.token2id[word] for word in topic if word in dictionary.token2id] for topic in topic_words] 32 # Calculate the coherence score using Gensim's CoherenceModel ---> 33 coherence_model = CoherenceModel(topics=topic_words_ids, texts=texts, dictionary=dictionary, coherence='c_v') 34 coherence_score = coherence_model.get_coherence() 36 print(f"Coherence Score: {coherence_score}")

File ~\AppData\Roaming\Python\Python310\site-packages\gensim\models\coherencemodel.py:214, in CoherenceModel.init(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes) 212 self._accumulator = None 213 self._topics = None --> 214 self.topics = topics 216 self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)

File ~\AppData\Roaming\Python\Python310\site-packages\gensim\models\coherencemodel.py:429, in CoherenceModel.topics(self, topics) 427 new_topics = [] 428 for topic in topics: --> 429 topic_token_ids = self._ensure_elements_are_ids(topic) 430 new_topics.append(topic_token_ids) 432 if self.model is not None:

File ~\AppData\Roaming\Python\Python310\site-packages\gensim\models\coherencemodel.py:453, in CoherenceModel._ensure_elements_are_ids(self, topic) 451 return np.array(ids_from_ids) 452 else: --> 453 raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')

ValueError: unable to interpret topic as either a list of tokens or a list of ids



### BERTopic Version

0.16.3
MaartenGr commented 2 months ago

This seems to be an issue with Gensim rather than with BERTopic. Specifically, it seems that CoherenceModel expects something different than what you are giving it. You could check the documentation of Gensim to check for specifics. Also note that there is an existing issue open with a more specific example: https://github.com/MaartenGr/BERTopic/issues/90

One last tip, coherence works on individual words and not on labels. So it is generally advised to use coherence for something like c-TF-IDF or KeyBERTInspired. For a label (like the one generated by an LLM) you would likely have to use human evaluation or perhaps even LLM-powered evaluation.