MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
718 stars 102 forks source link

doc2bow error when running lda optimizer described in your docs #121

Closed fadhleryani closed 6 months ago

fadhleryani commented 6 months ago

Description

I'm trying to test to optimizer described in your docs, and following the steps exactly (except for changing dataset.load to dataset.fetch_dataset) but I get the following error TypeError: doc2bow expects an array of unicode tokens on input, not a single string

What I Did

from skopt.space.space import Real
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.models.LDA import LDA
from octis.optimization.optimizer import Optimizer

optimizer = Optimizer()

model = LDA()
model.hyperparameters.update({"num_topics": 20})

dataset = Dataset()
dataset.fetch_dataset("M10")

metric_parameters = {
    'texts': dataset.get_corpus(),
    'topk': 10,
    'measure': 'c_npmi'
}
npmi = Coherence(metric_parameters)

search_space = {
    "alpha": Real(low=0.001, high=5.0),
    "eta": Real(low=0.001, high=5.0)
}

optimization_result = optimizer.optimize(model,
                                         dataset,
                                         npmi,
                                         search_space,
                                         number_of_call=10,
                                         n_random_starts=3,
                                         model_runs=3,
                                         save_name="result",
                                         surrogate_model="RF",
                                         acq_func="LCB"
                                         )

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], [line 16](vscode-notebook-cell:?execution_count=11&line=16)
      [9](vscode-notebook-cell:?execution_count=11&line=9) model.hyperparameters.update({"num_topics": 20})
     [11](vscode-notebook-cell:?execution_count=11&line=11) metric_parameters = {
     [12](vscode-notebook-cell:?execution_count=11&line=12)     'texts': dataset.get_corpus(),
     [13](vscode-notebook-cell:?execution_count=11&line=13)     'topk': 10,
     [14](vscode-notebook-cell:?execution_count=11&line=14)     'measure': 'c_npmi'
     [15](vscode-notebook-cell:?execution_count=11&line=15) }
---> [16](vscode-notebook-cell:?execution_count=11&line=16) npmi = Coherence(metric_parameters)
     [18](vscode-notebook-cell:?execution_count=11&line=18) search_space = {
     [19](vscode-notebook-cell:?execution_count=11&line=19)     "alpha": Real(low=0.001, high=5.0),
     [20](vscode-notebook-cell:?execution_count=11&line=20)     "eta": Real(low=0.001, high=5.0)
     [21](vscode-notebook-cell:?execution_count=11&line=21) }
     [23](vscode-notebook-cell:?execution_count=11&line=23) optimization_result = optimizer.optimize(model,
     [24](vscode-notebook-cell:?execution_count=11&line=24)                                          dataset,
     [25](vscode-notebook-cell:?execution_count=11&line=25)                                          npmi,
   (...)
     [32](vscode-notebook-cell:?execution_count=11&line=32)                                          acq_func="LCB"
     [33](vscode-notebook-cell:?execution_count=11&line=33)                                          )

File [/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/octis/evaluation_metrics/coherence_metrics.py:34](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/octis/evaluation_metrics/coherence_metrics.py:34), in Coherence.__init__(self, texts, topk, processes, measure)
     [32](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/octis/evaluation_metrics/coherence_metrics.py:32) else:
     [33](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/octis/evaluation_metrics/coherence_metrics.py:33)     self._texts = texts
---> [34](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/octis/evaluation_metrics/coherence_metrics.py:34) self._dictionary = Dictionary(self._texts)
     [35](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/octis/evaluation_metrics/coherence_metrics.py:35) self.topk = topk
     [36](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/octis/evaluation_metrics/coherence_metrics.py:36) self.processes = processes

File [/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:78](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:78), in Dictionary.__init__(self, documents, prune_at)
     [75](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:75) self.num_nnz = 0
     [77](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:77) if documents is not None:
---> [78](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:78)     self.add_documents(documents, prune_at=prune_at)
     [79](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:79)     self.add_lifecycle_event(
     [80](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:80)         "created",
     [81](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:81)         msg=f"built {self} from {self.num_docs} documents (total {self.num_pos} corpus positions)",
     [82](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:82)     )

File [/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:204](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:204), in Dictionary.add_documents(self, documents, prune_at)
    [201](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:201)         logger.info("adding document #%i to %s", docno, self)
    [203](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:203)     # update Dictionary with the document
--> [204](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:204)     self.doc2bow(document, allow_update=True)  # ignore the result, here we only care about updating token ids
    [206](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:206) logger.info("built %s from %i documents (total %i corpus positions)", self, self.num_docs, self.num_pos)

File [/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:241](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:241), in Dictionary.doc2bow(self, document, allow_update, return_missing)
    [209](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:209) """Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
    [210](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:210) 
    [211](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:211) Parameters
   (...)
    [238](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:238) 
    [239](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:239) """
    [240](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:240) if isinstance(document, str):
--> [241](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:241)     raise TypeError("doc2bow expects an array of unicode tokens on input, not a single string")
    [243](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:243) # Construct (word, frequency) mapping.
    [244](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/octis/lib/python3.10/site-packages/gensim/corpora/dictionary.py:244) counter = defaultdict(int)

TypeError: doc2bow expects an array of unicode tokens on input, not a single string
fadhleryani commented 6 months ago

Okii got it, just needed to change this line:

npmi = Coherence(metric_parameters) to npmi = Coherence(**metric_parameters)