numpy memory error on a dataset with 4M tweets

OCTIS version: 1.11.1
Python version: 3.7.16
Operating System: Ubuntu 16.04.7 LTS

Description

I want to train CTM on a dataset containing approximately 4 million tweets (with a vocabulary size of approximately 20,000). I get the following error message from the train_model() function: numpy.core. _exceptions.MemoryError: Unable to allocate 735. GiB for an array with shape (4344759, 22716) and data type int64

Is there a way to optimize the training process or incrementally train the model (similar to online topic modeling in BERTopic)?

What I Did

Command I ran:

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset

dataset = Dataset()
dataset.load_custom_dataset_from_folder(dataset_path)

model = CTM(num_topics=num_topics, model_type=model_type, bert_model=bert_model, bert_path=bert_path)
output = model.train_model(dataset)

Traceback:


Traceback (most recent call last) :
  File "model.py", line 90, in module>
    output = model.train model (dataset)
  File "/home/devanshjain/miniconda3/envs/octis/lib/python3.7/site-packages/octis/models/CTM.py", line147, in train_model
    bert model=self.hvperparameters["bert model"])
  File "/home/devanshjain/miniconda3/envs/octis/lib/python3.7/site-packages/octis/models/CTM.py", line215, in preprocess
    train data = dataset.CTMDataset(x train.toarray(), b_train, idx2token)
  File "/home/devanshjain/miniconda3/envs/octis/lib/python3.7/site-packages/scipy/sparse/compressed.py", line1039, in toarray
    out = self._process_toarray_args(order, out)
File "/home/devanshjain/miniconda3/envs/octis/lib/python3.7/site-packages/scipy/sparse/base.py", line1202, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core. _exceptions.MemoryError: Unable to allocate 735. GiB for an array with shape (4344759, 22716) and data type int64```

MIND-Lab / OCTIS

numpy memory error on a dataset with 4M tweets #97

Description

What I Did