OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
Missing info for custom data yields key error in hypersearch #4

tenggaard commented 3 years ago

Hi Octis Team,

Thanks for making this available!

When providing a custom dataset for a LDA hyperparameter seach, I get: KeyError: 'info'

This is not the case when I run a single model (no hypersearch), nor when I fetch the M10 dataset and use this.

If I manually add an info entry with a name for the dataset to the metadata attribute of the custom dataset, the hyperparameter search works fine.

Perhaps the required metadata could be auto-filled when providing custom data?

Best, Thyge

Code and traceback:

# Load modules
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Categorical

# Load custom dataset
dataset = Dataset()

# Initiate model
model = LDA(alpha=0.5, eta=0.5)  

# Define search space
search_space = {"num_topics": Categorical({15, 20, 25, 30})}

# Set number of runs

# Define evaluation metric
npmi = Coherence(texts=dataset.get_corpus())

# Hypersearch
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path=str(_models / 'Octis' / 'LDA'))

Current call:  0
KeyError                                  Traceback (most recent call last)
<ipython-input-225-087bc04aa55a> in <module>
      1 # Hypersearch
      2 optimizer=Optimizer()
----> 3 optimization_result = optimizer.optimize(
      4     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      5     model_runs=model_runs, save_models=True,

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    300             # Create an object related to the BO optimization
--> 301             results = OptimizerEvaluation(self, BO_results=res)
    303             # Save the object

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in __init__(self, optimizer, BO_results)
     45         # Info about optimization
     46         self.info = dict()
---> 47         dataset_info = optimizer.dataset.get_metadata()["info"]
     48         if dataset_info is not None:
     49             self.info.update({"dataset_name": dataset_info["name"]})

KeyError: 'info'

Adding this after loading custom data fixes the problem:

# Load existing metadata
meta_dict = dataset.get_metadata()

# Add name to dict
meta_dict['info'] = {'name':'dataset_name'}

# Update metadata
dataset._Dataset__metadata = meta_dict

# Verify info is updated
silviatti commented 3 years ago

Hello Thyge, thanks for using OCTIS and for reporting this error!

Your solution works well but I decided to handle this issue directly in the _optimizerevaluation.py file. I have just released version 1.2.1 of the library. The bug shouldn't occur anymore :)

