Open pariskang opened 2 years ago
From your code, I think the culprit here is nr_bins=10
. Since you have documents that range from 2000 to 2021, which are 21 years, binning them into 10 bins (years) will merge some of the years together. If you want to have a perspective on a yearly basis, it is advised to make sure that the value in nr_bins
matches the years that you want to be visualized, in this case, 21.
Another way to approach this is to forgo setting nr_bins
at all and instead making sure that in set(timestamps)
only years are found. Your timestamps
would then look a bit like this:
timestamps = [2009, 2009, 2010, 2010, 2011, 2012, ..., 2021]
That way, we make sure that in timestamps
only the years are given so that no nr_bins
would have to be defined.
Thank u for u kind reply. I reset the nr_bins=21
and tackle it down. But may I suggest to chose an area map to visualize the topic (DTM) because the uncertain trends may not be easy to understand. The following are using tableau software to display it.
@pariskang Thank you for the suggestion. BERTopic creates, and visualizes, often quite a number of topics with different kinds of distributions. I have tested it before with the area map that you have described but felt like the plot would become too busy. Having said that, it would definitely not hurt to experiment with it a little to see if we can improve it. I would put it on the list!
I feel sorry for replying late, hoping greatly for adding area map to show specific topics. Thank u~ If u need any help I can do, please text me.
Dear MaartenGr: I think u did a great work in Bertopic, especially for the convenience and visualization. I found an interesting thing that when plotting topics over time, the distribution of years are not sound like real-time. For example, I selected short text files which range from 2000-to 2021. I change the 'global_tuning' and 'evolution_tuning' parameters from True to False, But the topic over time distribution always missing members in 2021. The text files time distribution is nearly average. I sincerely need u help. The followings are my code.
from bertopic import BERTopic
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")
umap_model = UMAP(n_neighbors=15, n_components=5,
min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(umap_model=umap_model,vectorizer_model=vectorizer_model,top_n_words=30, min_topic_size=35, calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(tweets)
topics_over_time = topic_model.topics_over_time(docs=tweets,
topics=topics,
timestamps=timestamps,
global_tuning=True,
evolution_tuning=False,
nr_bins=10)