MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
705 stars 98 forks source link

Different Coherence Score in gensim and OCTIS #91

Closed p-dre closed 1 year ago

p-dre commented 1 year ago

Description

I wanted to test if switching from octis to gensim when calculating cohernce would decrease the calculation time, since I can't access processes directly with octis. I think I use the same procedure for the calculation. Nevertheless, the score is different.

What I Did

from octis.evaluation_metrics.coherence_metrics import Coherence
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure="c_npmi")
npmi.score(topics_dict)
-0.12705743104058387

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
import gensim
text = dataset.get_corpus()
dicti = Dictionary(dataset.get_corpus())
cm = CoherenceModel(topics=topics_dict,  coherence='c_npmi', dictionary = dicti, texts = text, processes=1, topn = 10)
cm.get_coherence()
0.0692187417114627
silviatti commented 1 year ago

Hello, It's weird because OCTIS uses gensim to compute the coherence (so I don't think you'll have a decrease in computation time). I will try to replicate it with another dataset.

silviatti commented 1 year ago

I tried with M10 dataset and these topics and I got the same result.

[
 ['network', 'neural', 'gas', 'application', 'base', 'control', 'gene', 'system', 'datum', 'expression'],
 ['network', 'system', 'model', 'neural', 'analysis', 'decision', 'structure', 'control', 'base', 'datum'],
 ['model', 'system', 'control', 'base', 'decision', 'approach', 'time', 'dynamic', 'effect', 'market']
]
p-dre commented 1 year ago

@silviatti yes exactly, that's why I'm wondering. Thanks for checking. I was able to find the error in my data preparation

Octis uses gensim to calculate the coharence, but via octis I can't set the number of cpu and via gensim I can (increasing processes). In addition, when comparing different models in octis, the dictionary is always recalculated. If I use gensim directly, I can avoid this, which is especially practical with large amounts of text.

start = time.time()
tokens= dataset.get_corpus()
dictionary = Dictionary(dataset.get_corpus())
cm = CoherenceModel(topics=topics_dict['topics'],
                                    coherence='c_npmi',
                                    dictionary = dictionary,
                                    texts = tokens,
                                    processes=16,
                                    topn = 10)

print(cm.get_coherence())
end = time.time()
print(end -start)

-0.11693189518567472
19.66211247444153

start = time.time()
npmi = Coherence(texts=tokens, topk=10, measure="c_npmi")
print(npmi.score(topics_dict))
end = time.time()
print(end -start)

-0.11693189518567472
81.55563473701477
silviatti commented 1 year ago

Hi, thanks for finding out the problem. I've added the parameter processes like in gensim in the Coherence metric definition. So now you can specify the number of processes in OCTIS as well. This will be available from the next release.

Silvia