MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.21k stars 767 forks source link

How to adjust parameters for plotting topic over time like real time distribution❓ #555

Open pariskang opened 2 years ago

pariskang commented 2 years ago

Dear MaartenGr: I think u did a great work in Bertopic, especially for the convenience and visualization. I found an interesting thing that when plotting topics over time, the distribution of years are not sound like real-time. For example, I selected short text files which range from 2000-to 2021. I change the 'global_tuning' and 'evolution_tuning' parameters from True to False, But the topic over time distribution always missing members in 2021. The text files time distribution is nearly average. I sincerely need u help. The followings are my code.

from bertopic import BERTopic from umap import UMAP

from sklearn.feature_extraction.text import CountVectorizer vectorizer_model = CountVectorizer(stop_words="english")

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42) topic_model = BERTopic(umap_model=umap_model,vectorizer_model=vectorizer_model,top_n_words=30, min_topic_size=35, calculate_probabilities=True, verbose=True) topics, probs = topic_model.fit_transform(tweets)

topics_over_time = topic_model.topics_over_time(docs=tweets, topics=topics, timestamps=timestamps, global_tuning=True, evolution_tuning=False, nr_bins=10)

image

MaartenGr commented 2 years ago

From your code, I think the culprit here is nr_bins=10. Since you have documents that range from 2000 to 2021, which are 21 years, binning them into 10 bins (years) will merge some of the years together. If you want to have a perspective on a yearly basis, it is advised to make sure that the value in nr_bins matches the years that you want to be visualized, in this case, 21.

Another way to approach this is to forgo setting nr_bins at all and instead making sure that in set(timestamps) only years are found. Your timestamps would then look a bit like this:

timestamps = [2009, 2009, 2010, 2010, 2011, 2012, ..., 2021]

That way, we make sure that in timestamps only the years are given so that no nr_bins would have to be defined.

pariskang commented 2 years ago

Thank u for u kind reply. I reset the nr_bins=21 and tackle it down. But may I suggest to chose an area map to visualize the topic (DTM) because the uncertain trends may not be easy to understand. The following are using tableau software to display it. image

MaartenGr commented 2 years ago

@pariskang Thank you for the suggestion. BERTopic creates, and visualizes, often quite a number of topics with different kinds of distributions. I have tested it before with the area map that you have described but felt like the plot would become too busy. Having said that, it would definitely not hurt to experiment with it a little to see if we can improve it. I would put it on the list!

pariskang commented 2 years ago

I feel sorry for replying late, hoping greatly for adding area map to show specific topics. Thank u~ If u need any help I can do, please text me.