Closed tenggaard closed 3 years ago
Hi! CTM needs the contextualized representations of the documents as input. The parameter "bert_path" indicates the path where they are stored if they exist or where to store them (in that case, it downloads the representations using the sentence-transformers library). We did so to avoid the repetitive download of the document representations, but I see now it may cause some problems. (Also, I need to fix the documentation of CTM.)
Is it possible that you already have some files named "_train.pkl", "_test.pkl" and "_val.pkl" but they correspond to a different dataset? In that case, CTM would load those files even if they correspond to a different dataset and throw the exception above because the sizes of the vocabularies do not match anymore.
Let me know if this is the case and we'll figure out a way to fix this.
Bye
Silvia
Hi Silvia, That was indeed the case - thanks for the support! Best, Thyge
Hi Octis team,
When I run your tutorial on my local server (jupyter notebook) I get an exception. I get the same exception when training a single model (no hypersearch) on custom data.
I have attemted to locate the problem, but when I reproduce the individual steps, it runs fine - otherwise happy to make a pull request, but not sure what is going on here...
One odd observation: while CTM.load_bert_data(bert_train_path, train, bert_model) runs prior to the CTMDataset(x_train.toarray(), b_train, idx2token) in preprocess (see below), and bert_embeddings_from_list from /models/contextualized_topic_models/utils/data_preparation.py/ defaults to 'show_progress_bar=True', the exception is thrown before any progress bar.
Tutorial, that yields exception
My reproduction, that works fine: