Exception raised when running tutorial 'How to optimize the hyperparameters of a neural topic model (CTM on M10)'

tenggaard commented 3 years ago

OCTIS version: 1.2.0
Python version: 3.8.3
Operating System: Linux

Hi Octis team,

When I run your tutorial on my local server (jupyter notebook) I get an exception. I get the same exception when training a single model (no hypersearch) on custom data.

I have attemted to locate the problem, but when I reproduce the individual steps, it runs fine - otherwise happy to make a pull request, but not sure what is going on here...

One odd observation: while CTM.load_bert_data(bert_train_path, train, bert_model) runs prior to the CTMDataset(x_train.toarray(), b_train, idx2token) in preprocess (see below), and bert_embeddings_from_list from /models/contextualized_topic_models/utils/data_preparation.py/ defaults to 'show_progress_bar=True', the exception is thrown before any progress bar.

    def preprocess(vocab, train, bert_model, test=None, validation=None,
                   bert_train_path=None, bert_test_path=None, bert_val_path=None):
        vocab2id = {w: i for i, w in enumerate(vocab)}
        vec = CountVectorizer(
            vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
        entire_dataset = train.copy()
        if test is not None:
            entire_dataset.extend(test)
        if validation is not None:
            entire_dataset.extend(validation)

        vec.fit(entire_dataset)
        idx2token = {v: k for (k, v) in vec.vocabulary_.items()}

        x_train = vec.transform(train)
        b_train = CTM.load_bert_data(bert_train_path, train, bert_model)

        train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
        input_size = len(idx2token.keys())

Tutorial, that yields exception

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence

dataset = Dataset()
dataset.fetch_dataset("M10")

model = CTM(num_topics=10, num_epochs=30, inference_type='zeroshot', bert_model="bert-base-nli-mean-tokens")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'sigmoid', 'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
}

optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path='results/test_ctm//')

Current call:  0
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-46-7718f92a8020> in <module>
      1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
      3     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      4     model_runs=model_runs, save_models=True,
      5     extra_metrics=None, # to keep track of other metrics

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    283             else:
    284                 next_x = opt.ask()
--> 285                 f_val = self._objective_function(next_x)
    286 
    287             # Update the opt using (next_x,f_val)

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _objective_function(self, hyperparameter_values)
    214 
    215             # Prepare model
--> 216             model_output = self.model.train_model(self.dataset, params,
    217                                                   self.topk)
    218             # Score of the model

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in train_model(self, dataset, hyperparameters, top_words)
     80             self.vocab = dataset.get_vocabulary()
     81             self.X_train, self.X_test, self.X_valid, input_size = \
---> 82                 self.preprocess(self.vocab, data_corpus_train, test=data_corpus_test,
     83                                 validation=data_corpus_validation,
     84                                 bert_train_path=self.hyperparameters['bert_path'] + "_train.pkl",

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in preprocess(vocab, train, bert_model, test, validation, bert_train_path, bert_test_path, bert_val_path)
    178         b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
    179 
--> 180         train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
    181         input_size = len(idx2token.keys())
    182 

~/anaconda3/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py in __init__(self, X, X_bert, idx2token)
     15         """
     16         if X.shape[0] != len(X_bert):
---> 17             raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
     18                             "You might want to check if the BoW preparation method has removed some documents. ")
     19 

Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.

My reproduction, that works fine:

def preprocess(vocab, train, bert_model, test=None, validation=None,
               bert_train_path=None, bert_test_path=None, bert_val_path=None):
    vocab2id = {w: i for i, w in enumerate(vocab)}
    vec = CountVectorizer(
        vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
    entire_dataset = train.copy()
    if test is not None:
        entire_dataset.extend(test)
    if validation is not None:
        entire_dataset.extend(validation)

    vec.fit(entire_dataset)
    idx2token = {v: k for (k, v) in vec.vocabulary_.items()}

    x_train = vec.transform(train)
    b_train = bert_embeddings_from_list(train, bert_model)

    train_data = CTMDataset(x_train.toarray(), b_train, idx2token)
    input_size = len(idx2token.keys())

    if test is not None and validation is not None:
        x_test = vec.transform(test)
        b_test = bert_embeddings_from_list(test, bert_model)
        test_data = CTMDataset(x_test.toarray(), b_test, idx2token)

        x_valid = vec.transform(validation)
        b_val = bert_embeddings_from_list(validation, bert_model)
        valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
        return train_data, test_data, valid_data, input_size
    if test is None and validation is not None:
        x_valid = vec.transform(validation)
        b_val = bert_embeddings_from_list(validation, bert_model)
        valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
        return train_data, valid_data, input_size
    if test is not None and validation is None:
        x_test = vec.transform(test)
        b_test = bert_embeddings_from_list(test, bert_model)
        test_data = CTMDataset(x_test.toarray(), b_test, idx2token)
        return train_data, test_data, input_size
    if test is None and validation is None:
        return train_data, input_size

def bert_embeddings_from_list(texts, sbert_model_to_load="bert-base-nli-mean-tokens", batch_size=100):
    """
    Creates SBERT Embeddings from a list
    """
    model = SentenceTransformer(sbert_model_to_load)
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))

import torch
from torch.utils.data import Dataset
import scipy.sparse

class CTMDataset(Dataset):

    """Class to load BOW dataset."""

    def __init__(self, X, X_bert, idx2token):
        """
        Args
            X : array-like, shape=(n_samples, n_features)
                Document word matrix.
        """
        if X.shape[0] != len(X_bert):
            raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
                            "You might want to check if the BoW preparation method has removed some documents. ")

        self.X = X
        self.X_bert = X_bert
        self.idx2token = idx2token

    def __len__(self):
        """Return length of dataset."""
        return self.X.shape[0]

    def __getitem__(self, i):
        """Return sample from dataset at index i."""
        if type(self.X[i]) == scipy.sparse.csr.csr_matrix:
            X = torch.FloatTensor(self.X[i].todense())
            X_bert = torch.FloatTensor(self.X_bert[i])
        else:
            X = torch.FloatTensor(self.X[i])
            X_bert = torch.FloatTensor(self.X_bert[i])

        return {'X': X, 'X_bert': X_bert}

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence

dataset = Dataset()
dataset.fetch_dataset("M10")

train, validation, test = dataset.get_partitioned_corpus(use_validation=True)

data_corpus_train = [' '.join(i) for i in train]
data_corpus_test = [' '.join(i) for i in test]
data_corpus_validation = [' '.join(i) for i in validation]

vocab = dataset.get_vocabulary()
X_train, X_test, X_valid, input_size = \
    preprocess(vocab, data_corpus_train, test=data_corpus_test,
                validation=data_corpus_validation,
                bert_train_path=""+"_train.pkl",
                bert_test_path=""+"_test.pkl",
                bert_val_path=""+"_val.pkl",
                bert_model='bert-base-nli-mean-tokens')

Batches: 100%
59/59 [00:08<00:00, 7.10it/s]

Batches: 100%
13/13 [00:01<00:00, 6.62it/s]

Batches: 100%
13/13 [00:00<00:00, 28.11it/s]

silviatti commented 3 years ago

Hi! CTM needs the contextualized representations of the documents as input. The parameter "bert_path" indicates the path where they are stored if they exist or where to store them (in that case, it downloads the representations using the sentence-transformers library). We did so to avoid the repetitive download of the document representations, but I see now it may cause some problems. (Also, I need to fix the documentation of CTM.)

Is it possible that you already have some files named "_train.pkl", "_test.pkl" and "_val.pkl" but they correspond to a different dataset? In that case, CTM would load those files even if they correspond to a different dataset and throw the exception above because the sizes of the vocabularies do not match anymore.

Let me know if this is the case and we'll figure out a way to fix this.

Bye

Silvia

tenggaard commented 3 years ago

Hi Silvia, That was indeed the case - thanks for the support! Best, Thyge

MIND-Lab / OCTIS

Exception raised when running tutorial 'How to optimize the hyperparameters of a neural topic model (CTM on M10)' #7