MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
734 stars 106 forks source link

numpy memory error on a dataset with 4M tweets #97

Closed devanshrj closed 1 year ago

devanshrj commented 1 year ago

Description

I want to train CTM on a dataset containing approximately 4 million tweets (with a vocabulary size of approximately 20,000). I get the following error message from the train_model() function: numpy.core. _exceptions.MemoryError: Unable to allocate 735. GiB for an array with shape (4344759, 22716) and data type int64

Is there a way to optimize the training process or incrementally train the model (similar to online topic modeling in BERTopic)?

What I Did

Command I ran:

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset

dataset = Dataset()
dataset.load_custom_dataset_from_folder(dataset_path)

model = CTM(num_topics=num_topics, model_type=model_type, bert_model=bert_model, bert_path=bert_path)
output = model.train_model(dataset)

Traceback:


Traceback (most recent call last) :
  File "model.py", line 90, in module>
    output = model.train model (dataset)
  File "/home/devanshjain/miniconda3/envs/octis/lib/python3.7/site-packages/octis/models/CTM.py", line147, in train_model
    bert model=self.hvperparameters["bert model"])
  File "/home/devanshjain/miniconda3/envs/octis/lib/python3.7/site-packages/octis/models/CTM.py", line215, in preprocess
    train data = dataset.CTMDataset(x train.toarray(), b_train, idx2token)
  File "/home/devanshjain/miniconda3/envs/octis/lib/python3.7/site-packages/scipy/sparse/compressed.py", line1039, in toarray
    out = self._process_toarray_args(order, out)
File "/home/devanshjain/miniconda3/envs/octis/lib/python3.7/site-packages/scipy/sparse/base.py", line1202, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core. _exceptions.MemoryError: Unable to allocate 735. GiB for an array with shape (4344759, 22716) and data type int64```
silviatti commented 1 year ago

Hi, CTM version in OCTIS is not the latest and at the moment we have no plans on updating it. If you see in the original repo, there have been some improvements to support larger datasets: https://github.com/MilaNLProc/contextualized-topic-models/pull/124 I'd suggest you use that repo directly.

Hope this helps,

Silvia