MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.2k stars 145 forks source link

ctm.save crashed when training_dataset is somewhat large #110

Closed elchorro closed 1 year ago

elchorro commented 2 years ago

I notice the ctm.save() method tries to save the training dataset (800k items in my case). This. however cause a crash on my machine.

I was able to resove the problem by deleting the reference to train_data in ctm.save and then modyfing the ctm.load method to pass a dataset.

In any case, it seems like storing the training dataset (except for id2token) may not be desirable in use cases where one wants to load a model to predict topics to unseen documents or continue training on a different dataset.

vinid commented 2 years ago

Thanks!

Let me label this as a bug.

Might make sense to remove the dataset in a future version of the model