Closed p-dre closed 1 year ago
Hello, It's weird because OCTIS uses gensim to compute the coherence (so I don't think you'll have a decrease in computation time). I will try to replicate it with another dataset.
I tried with M10 dataset and these topics and I got the same result.
[
['network', 'neural', 'gas', 'application', 'base', 'control', 'gene', 'system', 'datum', 'expression'],
['network', 'system', 'model', 'neural', 'analysis', 'decision', 'structure', 'control', 'base', 'datum'],
['model', 'system', 'control', 'base', 'decision', 'approach', 'time', 'dynamic', 'effect', 'market']
]
@silviatti yes exactly, that's why I'm wondering. Thanks for checking. I was able to find the error in my data preparation
Octis uses gensim to calculate the coharence, but via octis I can't set the number of cpu and via gensim I can (increasing processes). In addition, when comparing different models in octis, the dictionary is always recalculated. If I use gensim directly, I can avoid this, which is especially practical with large amounts of text.
start = time.time()
tokens= dataset.get_corpus()
dictionary = Dictionary(dataset.get_corpus())
cm = CoherenceModel(topics=topics_dict['topics'],
coherence='c_npmi',
dictionary = dictionary,
texts = tokens,
processes=16,
topn = 10)
print(cm.get_coherence())
end = time.time()
print(end -start)
-0.11693189518567472
19.66211247444153
start = time.time()
npmi = Coherence(texts=tokens, topk=10, measure="c_npmi")
print(npmi.score(topics_dict))
end = time.time()
print(end -start)
-0.11693189518567472
81.55563473701477
Hi, thanks for finding out the problem. I've added the parameter processes
like in gensim in the Coherence metric definition. So now you can specify the number of processes in OCTIS as well. This will be available from the next release.
Silvia
Description
I wanted to test if switching from octis to gensim when calculating cohernce would decrease the calculation time, since I can't access processes directly with octis. I think I use the same procedure for the calculation. Nevertheless, the score is different.
What I Did