MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.09k stars 757 forks source link

How to use over time visualization with another topic model? #457

Closed mertozlutiras closed 2 years ago

mertozlutiras commented 2 years ago

Hi Marteen,

I'm a master's thesis student working on Dynamic Topic Modelling on a large technical corpus.

I tried BERTopic on my corpus, however it takes too long to train and I don't get good results. My corpus works faster and better with contextualized-topic-embeddings (milaNLP). Nevertheless, I really love your visualization. I couldn't find how to implement your visualization with another topic model though. Could you recommend me any resource that can help me integrate the visualizations?

On a second note, I also didn't understand how to match different topics occurring at different timestamps. Let's say I generate topics at timepoint 0 and then again generate at timepoint 1. How are we going to know different topics in different timepoints are actually pointing to same topic? I'd also appreciate if you could recommend me any resource about this.

Sorry if github is not the correct platform for these questions,

Thanks for the help, Mert

MaartenGr commented 2 years ago

I tried BERTopic on my corpus, however it takes too long to train and I don't get good results. My corpus works faster and better with contextualized-topic-embeddings (milaNLP).

Could you share what you have tried and what exactly did not work out for you? Perhaps I can help you speed up the process and improve upon the results.

I couldn't find how to implement your visualization with another topic model though. Could you recommend me any resource that can help me integrate the visualizations?

The visualizations are not meant to be easily implemented with other topic modeling techniques as that would be out of the scope of BERTopic. Also, and I might be mistaken here, the topic modeling technique that you are referring to does not support dynamic topic modeling right? If so, then implementing the visualization would not be possible.

On a second note, I also didn't understand how to match different topics occurring at different timestamps. Let's say I generate topics at timepoint 0 and then again generate at timepoint 1. How are we going to know different topics in different timepoints are actually pointing to same topic?

In general, there is no one way of matching topics at different timestamps, and is typically not easily possible after creating a global representation of topics. In other words, a topic model should typically model the Dynamic Topic Modeling process during training and not after training. LDAseq, for example, does it quite differently from BERTopic whilst traditional models are not capable of doing this.

This also means that it is often not advised to generate topics at timepoint 0 and then again generate them at timepoint 1. You risk generating completely different and often inaccurate topics that are quite difficult to match.

mertozlutiras commented 2 years ago

I tried BERTopic on my corpus, however it takes too long to train and I don't get good results. My corpus works faster and better with contextualized-topic-embeddings (milaNLP).

Could you share what you have tried and what exactly did not work out for you? Perhaps I can help you speed up the process and improve upon the results.

I have two multilingual corpora. One of them is quite big, around 2gb. It took very long to train BERTopic multilingual on the large corpus even on an aws GPU machine, so I cancelled the training and tried it only on the small corpus.

I run BERTopic with following configurations: topic_model = BERTopic(min_topic_size=25, verbose=True, language="multilingual", embedding_model ="distiluse-base-multilingual-cased")

It produces only few topics compared to milaNLP. I have around 1300 documents. BERTopic assings 400 of them to group -1, the rest among 4 topics. I'm expecting to have around 20 topics.

I couldn't find how to implement your visualization with another topic model though. Could you recommend me any resource that can help me integrate the visualizations?

The visualizations are not meant to be easily implemented with other topic modeling techniques as that would be out of the scope of BERTopic. Also, and I might be mistaken here, the topic modeling technique that you are referring to does not support dynamic topic modeling right? If so, then implementing the visualization would not be possible.

No it doesn't support dynamic topic modeling.

On a second note, I also didn't understand how to match different topics occurring at different timestamps. Let's say I generate topics at timepoint 0 and then again generate at timepoint 1. How are we going to know different topics in different timepoints are actually pointing to same topic?

In general, there is no one way of matching topics at different timestamps, and is typically not easily possible after creating a global representation of topics. In other words, a topic model should typically model the Dynamic Topic Modeling process during training and not after training. LDAseq, for example, does it quite differently from BERTopic whilst traditional models are not capable of doing this.

Since the topic model I wanted to use doesn't support dynamic topic modelling, I was thinking of running a static topic model at different timestamps and find a method to match topics from different timestamps pointing to the same topic. I guess this won't be possible.

This also means that it is often not advised to generate topics at timepoint 0 and then again generate them at timepoint 1. You risk generating completely different and often inaccurate topics that are quite difficult to match.

I got it now, thanks.

Thank you for your time and help.

MaartenGr commented 2 years ago

I run BERTopic with following configurations: topic_model = BERTopic(min_topic_size=25, verbose=True, language="multilingual", embedding_model ="distiluse-base-multilingual-cased")

I would advise using paraphrase-multilingual-MiniLM-L12-v2 instead of the distilluse model as it is a fair bit faster and likely to be more accurate.

It produces only few topics compared to milaNLP. I have around 1300 documents. BERTopic assings 400 of them to group -1, the rest among 4 topics. I'm expecting to have around 20 topics.

If you set the min_topic_size a bit lower, for example at 10, that will likely procedure much more topics.

Since the topic model I wanted to use doesn't support dynamic topic modelling, I was thinking of running a static topic model at different timestamps and find a method to match topics from different timestamps pointing to the same topic. I guess this won't be possible.

There are some tricks you can try to find similar topics but you cannot be sure that they are exactly the same. For example, by comparing the topic-word distributions between topics at different timestamps you can try to find topics that are similar. However, I am not entirely sure you can call it dynamic topic modeling in that case.

I got it now, thanks.

Thank you for your time and help.

No problem! Let me know if you have any other questions.