MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.21k stars 147 forks source link

If remainder of batch size is 1 code throws a value error #103

Closed amirmohammadkz closed 2 years ago

amirmohammadkz commented 2 years ago

Description

I was trying to train a CTM on my dataset. However, I got a value error. Tried with a similar dataset with a different number of samples, and it worked. I think the problem is that the sample size is not dividable by the batch size and the remainder is 1.

What I Did

  File "C:\Users\ColdFire\Documents\sleepHealth_data_analysis\myvenv\lib\site-packages\psylap\psycholinguistic_extractors\topic_modelling_extractor_CTM.py", line 154, in extract_topic_modelling
    self.ctm.fit(self.training_dataset)  # run the model
  File "C:\Users\ColdFire\Documents\sleepHealth_data_analysis\myvenv\lib\site-packages\contextualized_topic_models\models\ctm.py", line 274, in fit
    sp, train_loss = self._train_epoch(train_loader)
  File "C:\Users\ColdFire\Documents\sleepHealth_data_analysis\myvenv\lib\site-packages\contextualized_topic_models\models\ctm.py", line 194, in _train_epoch
    posterior_log_variance, word_dists, estimated_labels = self.model(X_bow, X_contextual, labels)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\ColdFire\Documents\sleepHealth_data_analysis\myvenv\lib\site-packages\contextualized_topic_models\networks\decoding_network.py", line 101, in forward
    posterior_mu, posterior_log_sigma = self.inf_net(x, x_bert, labels)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\ColdFire\Documents\sleepHealth_data_analysis\myvenv\lib\site-packages\contextualized_topic_models\networks\inference_network.py", line 141, in forward
    mu = self.f_mu_batchnorm(self.f_mu(x))
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\batchnorm.py", line 167, in forward
    return F.batch_norm(
  File "C:\Python39\lib\site-packages\torch\nn\functional.py", line 2279, in batch_norm
    _verify_batch_size(input.size())
  File "C:\Python39\lib\site-packages\torch\nn\functional.py", line 2247, in _verify_batch_size
    raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 4])
vinid commented 2 years ago

Hello!

Yes I think this is an issue wrt to the batch norm that cannot be computed with one single sample I am not sure if there's an easy workaround for this but I am happy to work on a fix if you find something that can be used to bypass this issue!

DerekChia commented 2 years ago

Hello, I managed to get around this by defining the batch_size and dropping anything that is out of the batch. The default batch_size is set to 64 (see here) so you can pass in as a parameter to the model as well e.g. batch_size=batch_size below

from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

documents = [line.strip() for line in open("unpreprocessed_documents.txt").readlines()]

batch_size = 64
documents = documents[: len(documents) // batch_size * batch_size]

sp = WhiteSpacePreprocessing(documents, "english")
preprocessed_documents, unpreprocessed_documents, vocab = sp.preprocess()

....

ctm = CombinedTM(
    bow_size=len(tp.vocab),
    contextual_size=768,
    n_components=100,
    num_epochs=20,
    batch_size=batch_size,
)
vinid commented 2 years ago

Thanks a lot @DerekChia!