Using DeCLUTR text embeddings in CTM

MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).

MIT License

1.2k stars 145 forks source link

Using DeCLUTR text embeddings in CTM #86

Closed kingafy closed 3 years ago

kingafy commented 3 years ago

I have trained declutur on legal dataset and I am trying to use it to extract topics using CTM.Currently process runs for infinite time without showing any output. Any idea why CTM does not work here.Is there any limiation on the models to be used?Cant we use our models on this framework?

vinid commented 3 years ago

Hello!

Can you share some more details about which dataset are you using/can you send a sample?

Also, if you send me the code I can take a look!:)

Are you using a GPU?

kingafy commented 3 years ago

Data set is confidential. The same data gets processed properly with the default model paraphrase-distilroberta-base-v1. The model which I have trained is in my local and I am loading from local directory.It gets stuck in the below code:-

tp = TopicModelDataPreparation("local folder") training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

vinid commented 3 years ago

Got it, I think the issue is that we do not directly support DeCLUTR embeddings (we use sbert). However, there is a simple way to bypass the issue by simply manually loading the embeddings.

See this notebook for an example that uses the default DeCLUTR model:https://colab.research.google.com/drive/19LlXY0F_V0zMzR79AUzhV51GdA_njZZF?usp=sharing

It should also work with your own fine-tuned model.

Let me know if and how it works!