ETM model corpus size - Githubissues

thomaskir commented 2 years ago

OCTIS version: 1.10.4
Python version: 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]
Operating System: Windows 10

Description

Hello, I’m currently working with the ETM model using two different corpora. While for the smaller corpus the model works well (corpus 10 MB, vocabulary 1.5 MB), the larger corpus (corpus 95 MB, vocabulary 27 MB) cannot be used to train the model, as the notebook crashes. I suppose this has to do with the size of the corpus, as the data is preprocessed exactly the same. How can this be solved so the ETM model can be applied to the larger corpus as well? Thank you in advance! Best, Thomas

What I Did

The line: 
model_output = model.train_model(dataset)

returns the following error message:

RuntimeError: [enforce fail at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 6497200864 bytes.

silviatti commented 2 years ago

Hello, how many words does the larger vocabulary contain approximately? We integrated ETM in OCTIS but we kept the original implementation, which is not optimized for large corpora. My suggestion would be to perform an additional preprocessing on the data to remove infrequent words and thus reduce the vocabulary size. But it is possible that is not only a problem of the vocabulary size, but of the number of documents.

Hope this helped,

Silvia

thomaskir commented 2 years ago

Dear Silvia, thank you for the quick response. The corpus contains 603 documents comprising 15.789.485 tokens. The vocabulary consists of 2.491.259 tokens. Best, Thomas

MIND-Lab / OCTIS

ETM model corpus size #66

Description

What I Did