More time spent for finding smaller number of topics

MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).

MIT License

1.2k stars 145 forks source link

More time spent for finding smaller number of topics #140

Closed mitramir55 closed 1 year ago

mitramir55 commented 1 year ago

Contextualized Topic Models version: 2.5.0
Python version: 3.10.8
Operating System: Linux

Description

I've been keeping track of the calculation time while I ran the Combined Topic modeling approach on my dataset (about 1 million Tweets). I observe that the computation becomes smaller as the number of topics increases. I'm curious about the justification. This is a preview of the models I've used so far:

newplot (19)

I will be running the zero-shot approach for monolingual datasets as well, but for now this is what I'm seeing.

vinid commented 1 year ago

Hello!

how is computation time computed here?

mitramir55 commented 1 year ago

Hi, here is the code:

# Prepare CTM data
qt_ctm = TopicModelDataPreparation("all-mpnet-base-v2")
training_dataset_ctm = qt_ctm.fit(
    text_for_contextual=docs, text_for_bow=preprocessed_documents
)

# training
params = {
    "n_components": k_topics,
    "contextual_size":768
}
params["bow_size"] = len(qt_ctm.vocab)

ctm = CombinedTM(**params)

start = time.time()
ctm.fit(training_dataset_ctm)
end = time.time()
computation_time = float(end - start)

vinid commented 1 year ago

I am not able to reproduce this on one of our datasets: https://colab.research.google.com/drive/1-lv5aUWpW4ToJoU2AoODrftmOccWhjFS?usp=sharing

Screenshot from 2023-07-21 17-21-07

mitramir55 commented 1 year ago

My data consists of more than a million tweets, can number of records have anything to do with it?

vinid commented 1 year ago

It shouldn't, since we are still doing batches and that's what matters.

Unless we have something else that is slowing down the process.

How long does it take to go through the entire dataset (I see 1 in your plots but I am not sure if it's hours or days).