MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
734 stars 106 forks source link

Missing info for custom data yields key error in hypersearch #4

Closed tenggaard closed 3 years ago

tenggaard commented 3 years ago

Hi Octis Team,

Thanks for making this available!

When providing a custom dataset for a LDA hyperparameter seach, I get: KeyError: 'info'

This is not the case when I run a single model (no hypersearch), nor when I fetch the M10 dataset and use this.

If I manually add an info entry with a name for the dataset to the metadata attribute of the custom dataset, the hyperparameter search works fine.

Perhaps the required metadata could be auto-filled when providing custom data?

Best, Thyge

Code and traceback:

# Load modules
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Categorical

# Load custom dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder(str(_dp))

# Initiate model
model = LDA(alpha=0.5, eta=0.5)  

# Define search space
search_space = {"num_topics": Categorical({15, 20, 25, 30})}

# Set number of runs
optimization_runs=15
model_runs=1 

# Define evaluation metric
npmi = Coherence(texts=dataset.get_corpus())

# Hypersearch
optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path=str(_models / 'Octis' / 'LDA'))

Current call:  0
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-225-087bc04aa55a> in <module>
      1 # Hypersearch
      2 optimizer=Optimizer()
----> 3 optimization_result = optimizer.optimize(
      4     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      5     model_runs=model_runs, save_models=True,

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    299 
    300             # Create an object related to the BO optimization
--> 301             results = OptimizerEvaluation(self, BO_results=res)
    302 
    303             # Save the object

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in __init__(self, optimizer, BO_results)
     45         # Info about optimization
     46         self.info = dict()
---> 47         dataset_info = optimizer.dataset.get_metadata()["info"]
     48         if dataset_info is not None:
     49             self.info.update({"dataset_name": dataset_info["name"]})

KeyError: 'info'

Adding this after loading custom data fixes the problem:

# Load existing metadata
meta_dict = dataset.get_metadata()

# Add name to dict
meta_dict['info'] = {'name':'dataset_name'}

# Update metadata
dataset._Dataset__metadata = meta_dict

# Verify info is updated
dataset.get_info()
silviatti commented 3 years ago

Hello Thyge, thanks for using OCTIS and for reporting this error!

Your solution works well but I decided to handle this issue directly in the _optimizerevaluation.py file. I have just released version 1.2.1 of the library. The bug shouldn't occur anymore :)

Best,

Silvia