MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.04k stars 757 forks source link

Online topic modeling vs dynamic topic modeling #1592

Open yujing-syj opened 11 months ago

yujing-syj commented 11 months ago

Hi @MaartenGr,

I try to implement BERTopic on large transcripts. Hope to get some advice from you. Thanks!

I am working on Financial Announcement related data. There are approximately 15 million sentences every year. I want to know what topics are covered in the transcripts over time. First, I plan to use several years (at least 2~3 years) data to build a basic BERTopic model and check the overall topics. Since we want to have more granular and stable topics which we can treat as a base, we use long period of data here. At the same time, I want to update the model every quarter with the quarterly new transcripts coming in. For example, if we train our basic model using data between 2018- Match 2020. We hope our model will automatically add topics such as "covid" (Covid related topics started to pop out in our transcripts) when we fill-in data in 2020Q2 to update the model. We also want to keep those old topics in order to track their trend over time.

I noticed that there are many possible tools I can use such as dynamic topic modeling, online topic modeling, or manual topic modeling(?). If we estimate the model over time, how do we identify a rising theme while not losing existing themes due to model randomness? Do you have any suggestions about which way is more suitable in this case?

Another question is about online topic modeling vs dynamic topic modeling. If I use dynamic topic modeling here, whether it will only give me the change of topic representation (keywords) over time instead of showing some new emerging topics at current time points? If I am using online topic modeling, whether feeding in different size of data for each loop makes sense? How to make sure that topics from basic topic(Tn) will be relatively stable?

Thanks again for your help!

MaartenGr commented 11 months ago

If we estimate the model over time, how do we identify a rising theme while not losing existing themes due to model randomness? How to make sure that topics from basic topic(Tn) will be relatively stable?

Model randomness can be circumvented by setting a random_state in UMAP.

Do you have any suggestions about which way is more suitable in this case?

If you want to consistently get new topics, then online topic modeling would be more appropriate. Especially if you expect new data coming in. However, if you already have all of the data, then it seems like dynamic topic modeilng would be more appriopriate since you can just train a single model and model it over time.

If I use dynamic topic modeling here, whether it will only give me the change of topic representation (keywords) over time instead of showing some new emerging topics at current time points?

It is meant to show the change of topic representation but it will also show the distribution of topics over time. That distribution will also show whether certain topics have emerged or not.

If I am using online topic modeling, whether feeding in different size of data for each loop makes sense?

Not necessarily. Generally, I would advise making the data significant enough to be learned from and not pass them one at a time.