MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
705 stars 98 forks source link

Error calculating coherence score for BERTopic model trained on Indic language #120

Open sanketshinde0707 opened 6 months ago

sanketshinde0707 commented 6 months ago

Description

I am working with BERTopic and I am trying to evaluate my topic models trained on Marathi language (Indic language) using some metrics.I found this code written by MaartenGR (Author of BERTopic) but unfortunately I was not able to install the dependencies of the setup he has mentioned here (https://github.com/MaartenGr/BERTopic_evaluation/tree/main). The author recommended using OCTIS as it provides more metrics. I tried calculating the topic diversity and npmi score. The topic diversity is calculated,but I keep getting issues while calculating npmi score.

Here is my code

from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.diversity_metrics import TopicDiversity

#This is how the sentence arrays looks
sentence array = ['तीन दिवस झाले, पण गाडी अजून सापडली नाही. पोलिसांचा कडक तपास सुरु आहे.' , 'डाळी भारतीय थाळीमध्ये सामील असलेले मुख्य भोजन आहेत.'] 

#This is how the topics are 
topics_list = [
['ठाकरे', 'एक', 'भारतीय', 'दिवस', 'शिंदे', 'सांगितले', 'दोन', 'माहिती', 'देण्यात', 'जात'],
['भारतीय', 'शिंदे', 'ठाकरे', 'मुख्यमंत्री', 'उद्धव', 'एक', 'पोलीस', 'धावा', 'दोन', 'सरकार'],
['देण्यात', 'फोन', 'डेटा', 'कॅमेरा', 'स्मार्टफोन', 'सादर', 'डिस्प्ले', 'सेन्सर', 'सपोर्ट', 'बॅटरी']
]

octis_texts = [sentence_array]
npmi = Coherence(texts = octis_texts, topk = 10, measure = 'c_npmi')
octis_output = {"topics": list1}
topic_diversity = TopicDiversity(topk=10)

topic_diversity_score = topic_diversity.score(octis_output)
print("Topic diversity: "+str(topic_diversity_score))

npmi_score = npmi.score(octis_output)
print("Coherence: "+str(npmi_score))

Error

This is the error I get.

Topic diversity: 0.8857142857142857
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-68-c000efdb667a>](https://localhost:8080/#) in <cell line: 5>()
      3 print("Topic diversity: "+str(topic_diversity_score))
      4 
----> 5 npmi_score = npmi.score(octis_output)
      6 print("Coherence: "+str(npmi_score))

3 frames
[/usr/local/lib/python3.10/dist-packages/gensim/models/coherencemodel.py](https://localhost:8080/#) in _ensure_elements_are_ids(self, topic)
    452             return np.array(ids_from_ids)
    453         else:
--> 454             raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')
    455 
    456     def _update_accumulator(self, new_topics):

ValueError: unable to interpret topic as either a list of tokens or a list of ids

Can anyone point out what exactly is wrong here and how can i evaluate BERTopic models trained on indic languages.

Thanks.

jiezhao2002 commented 3 months ago

Hey I've encountered the same issue - have you resolved it yet?