Closed thomaskir closed 2 years ago
Hello, how many words does the larger vocabulary contain approximately? We integrated ETM in OCTIS but we kept the original implementation, which is not optimized for large corpora. My suggestion would be to perform an additional preprocessing on the data to remove infrequent words and thus reduce the vocabulary size. But it is possible that is not only a problem of the vocabulary size, but of the number of documents.
Hope this helped,
Silvia
Dear Silvia, thank you for the quick response. The corpus contains 603 documents comprising 15.789.485 tokens. The vocabulary consists of 2.491.259 tokens. Best, Thomas
Description
Hello, I’m currently working with the ETM model using two different corpora. While for the smaller corpus the model works well (corpus 10 MB, vocabulary 1.5 MB), the larger corpus (corpus 95 MB, vocabulary 27 MB) cannot be used to train the model, as the notebook crashes. I suppose this has to do with the size of the corpus, as the data is preprocessed exactly the same. How can this be solved so the ETM model can be applied to the larger corpus as well? Thank you in advance! Best, Thomas
What I Did