MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
729 stars 105 forks source link

Exception raised when running tutorial 'How to optimize the hyperparameters of a neural topic model (CTM on M10)' #7

Closed tenggaard closed 3 years ago

tenggaard commented 3 years ago

Hi Octis team,

When I run your tutorial on my local server (jupyter notebook) I get an exception. I get the same exception when training a single model (no hypersearch) on custom data.

I have attemted to locate the problem, but when I reproduce the individual steps, it runs fine - otherwise happy to make a pull request, but not sure what is going on here...

One odd observation: while CTM.load_bert_data(bert_train_path, train, bert_model) runs prior to the CTMDataset(x_train.toarray(), b_train, idx2token) in preprocess (see below), and bert_embeddings_from_list from /models/contextualized_topic_models/utils/data_preparation.py/ defaults to 'show_progress_bar=True', the exception is thrown before any progress bar.

    def preprocess(vocab, train, bert_model, test=None, validation=None,
                   bert_train_path=None, bert_test_path=None, bert_val_path=None):
        vocab2id = {w: i for i, w in enumerate(vocab)}
        vec = CountVectorizer(
            vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
        entire_dataset = train.copy()
        if test is not None:
            entire_dataset.extend(test)
        if validation is not None:
            entire_dataset.extend(validation)

        vec.fit(entire_dataset)
        idx2token = {v: k for (k, v) in vec.vocabulary_.items()}

        x_train = vec.transform(train)
        b_train = CTM.load_bert_data(bert_train_path, train, bert_model)

        train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
        input_size = len(idx2token.keys())

Tutorial, that yields exception

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence

dataset = Dataset()
dataset.fetch_dataset("M10")

model = CTM(num_topics=10, num_epochs=30, inference_type='zeroshot', bert_model="bert-base-nli-mean-tokens")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'sigmoid', 'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
}

optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path='results/test_ctm//')

Current call:  0
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-46-7718f92a8020> in <module>
      1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
      3     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      4     model_runs=model_runs, save_models=True,
      5     extra_metrics=None, # to keep track of other metrics

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    283             else:
    284                 next_x = opt.ask()
--> 285                 f_val = self._objective_function(next_x)
    286 
    287             # Update the opt using (next_x,f_val)

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _objective_function(self, hyperparameter_values)
    214 
    215             # Prepare model
--> 216             model_output = self.model.train_model(self.dataset, params,
    217                                                   self.topk)
    218             # Score of the model

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in train_model(self, dataset, hyperparameters, top_words)
     80             self.vocab = dataset.get_vocabulary()
     81             self.X_train, self.X_test, self.X_valid, input_size = \
---> 82                 self.preprocess(self.vocab, data_corpus_train, test=data_corpus_test,
     83                                 validation=data_corpus_validation,
     84                                 bert_train_path=self.hyperparameters['bert_path'] + "_train.pkl",

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in preprocess(vocab, train, bert_model, test, validation, bert_train_path, bert_test_path, bert_val_path)
    178         b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
    179 
--> 180         train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
    181         input_size = len(idx2token.keys())
    182 

~/anaconda3/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py in __init__(self, X, X_bert, idx2token)
     15         """
     16         if X.shape[0] != len(X_bert):
---> 17             raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
     18                             "You might want to check if the BoW preparation method has removed some documents. ")
     19 

Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.

My reproduction, that works fine:

def preprocess(vocab, train, bert_model, test=None, validation=None,
               bert_train_path=None, bert_test_path=None, bert_val_path=None):
    vocab2id = {w: i for i, w in enumerate(vocab)}
    vec = CountVectorizer(
        vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
    entire_dataset = train.copy()
    if test is not None:
        entire_dataset.extend(test)
    if validation is not None:
        entire_dataset.extend(validation)

    vec.fit(entire_dataset)
    idx2token = {v: k for (k, v) in vec.vocabulary_.items()}

    x_train = vec.transform(train)
    b_train = bert_embeddings_from_list(train, bert_model)

    train_data = CTMDataset(x_train.toarray(), b_train, idx2token)
    input_size = len(idx2token.keys())

    if test is not None and validation is not None:
        x_test = vec.transform(test)
        b_test = bert_embeddings_from_list(test, bert_model)
        test_data = CTMDataset(x_test.toarray(), b_test, idx2token)

        x_valid = vec.transform(validation)
        b_val = bert_embeddings_from_list(validation, bert_model)
        valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
        return train_data, test_data, valid_data, input_size
    if test is None and validation is not None:
        x_valid = vec.transform(validation)
        b_val = bert_embeddings_from_list(validation, bert_model)
        valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
        return train_data, valid_data, input_size
    if test is not None and validation is None:
        x_test = vec.transform(test)
        b_test = bert_embeddings_from_list(test, bert_model)
        test_data = CTMDataset(x_test.toarray(), b_test, idx2token)
        return train_data, test_data, input_size
    if test is None and validation is None:
        return train_data, input_size

def bert_embeddings_from_list(texts, sbert_model_to_load="bert-base-nli-mean-tokens", batch_size=100):
    """
    Creates SBERT Embeddings from a list
    """
    model = SentenceTransformer(sbert_model_to_load)
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))

import torch
from torch.utils.data import Dataset
import scipy.sparse

class CTMDataset(Dataset):

    """Class to load BOW dataset."""

    def __init__(self, X, X_bert, idx2token):
        """
        Args
            X : array-like, shape=(n_samples, n_features)
                Document word matrix.
        """
        if X.shape[0] != len(X_bert):
            raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
                            "You might want to check if the BoW preparation method has removed some documents. ")

        self.X = X
        self.X_bert = X_bert
        self.idx2token = idx2token

    def __len__(self):
        """Return length of dataset."""
        return self.X.shape[0]

    def __getitem__(self, i):
        """Return sample from dataset at index i."""
        if type(self.X[i]) == scipy.sparse.csr.csr_matrix:
            X = torch.FloatTensor(self.X[i].todense())
            X_bert = torch.FloatTensor(self.X_bert[i])
        else:
            X = torch.FloatTensor(self.X[i])
            X_bert = torch.FloatTensor(self.X_bert[i])

        return {'X': X, 'X_bert': X_bert}

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence

dataset = Dataset()
dataset.fetch_dataset("M10")

train, validation, test = dataset.get_partitioned_corpus(use_validation=True)

data_corpus_train = [' '.join(i) for i in train]
data_corpus_test = [' '.join(i) for i in test]
data_corpus_validation = [' '.join(i) for i in validation]

vocab = dataset.get_vocabulary()
X_train, X_test, X_valid, input_size = \
    preprocess(vocab, data_corpus_train, test=data_corpus_test,
                validation=data_corpus_validation,
                bert_train_path=""+"_train.pkl",
                bert_test_path=""+"_test.pkl",
                bert_val_path=""+"_val.pkl",
                bert_model='bert-base-nli-mean-tokens')

Batches: 100%
59/59 [00:08<00:00, 7.10it/s]

Batches: 100%
13/13 [00:01<00:00, 6.62it/s]

Batches: 100%
13/13 [00:00<00:00, 28.11it/s]
silviatti commented 3 years ago

Hi! CTM needs the contextualized representations of the documents as input. The parameter "bert_path" indicates the path where they are stored if they exist or where to store them (in that case, it downloads the representations using the sentence-transformers library). We did so to avoid the repetitive download of the document representations, but I see now it may cause some problems. (Also, I need to fix the documentation of CTM.)

Is it possible that you already have some files named "_train.pkl", "_test.pkl" and "_val.pkl" but they correspond to a different dataset? In that case, CTM would load those files even if they correspond to a different dataset and throw the exception above because the sizes of the vocabularies do not match anymore.

Let me know if this is the case and we'll figure out a way to fix this.

Bye

Silvia

tenggaard commented 3 years ago

Hi Silvia, That was indeed the case - thanks for the support! Best, Thyge