Closed lukasmackin closed 1 year ago
I am not entirely sure bit this might be a side effect of using online topic modeling here. The clustering algorithm might have incorrectly assigned some documents to a topic during the early rounds of partial fitting which explains why the assignment suddenly changes. It might be worthwhile to use .transform
to reassign the documents.
Also, it might be worthwhile to use .fit
instead of .partial_fit
. The results generally improve with .fit
and seeing as you do not have that much data, I do not think there is necessarily much benefit to using .partial_fit
over .fit
.
I am not entirely sure bit this might be a side effect of using online topic modeling here. The clustering algorithm might have incorrectly assigned some documents to a topic during the early rounds of partial fitting which explains why the assignment suddenly changes. It might be worthwhile to use .transform to reassign the documents.
I tried following this suggestion. I amended my original code (shown below) to incorporate .transform
but it seems that the issue still persists. Did I incorporate your guidance incorrectly here?
Note: I update my River class with a predict function as discussed in this issue: #1297
data = pd.read_csv("/tr/proj15/txt_factors/Topic Linkage Experiments/Pull Reuters Data/Output/2020_textual_data.csv")
#Separate text data to generate topics over time
whole_text_data = data["body"]
whole_text_data = whole_text_data.replace('\n', ' ')
whole_text_data.reset_index(inplace = True, drop = True)
date_col = data["month_date"].to_list()
unique_dates_df = data.drop_duplicates(subset=['month_date'])
timestamps = unique_dates_df["month_date"].to_list()
#Set up parameters for Bertopic model
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')
cluster_model = River(cluster.DBSTREAM())
vectorizer_model = OnlineCountVectorizer(stop_words="english", ngram_range=(1,4))
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
umap_model = UMAP(n_neighbors=25,
n_components=10,
metric='cosine')
topic_model = BERTopic(
umap_model = umap_model,
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
nr_topics = "auto"
)
#Incrementally learn
topics = []
for month in timestamps:
month_df = data.loc[data['month_date'] == month]
text_data = month_df["body"]
text_data = text_data.replace('\n', ' ')
text_data.reset_index(inplace = True, drop = True)
topic_model.partial_fit(text_data)
topics.extend(topic_model.topics_)
topic_model.topics_ = topics
topic_model.transform(whole_text_data)
topics_over_time = topic_model.topics_over_time(whole_text_data, date_col,
datetime_format="%Y-%m",
global_tuning = True,
evolution_tuning = True)
topics_over_time.to_csv('2020_topics_over_time.csv', index = False)
Also, it might be worthwhile to use .fit instead of .partial_fit. The results generally improve with .fit and seeing as you do not have that much data, I do not think there is necessarily much benefit to using .partial_fit over .fit.
I completely agree that using .fit
makes much more sense with this smaller dataset. However, I'm just experimenting with this smaller set of data before I run this procedure on a much larger dataset. I think it will eventually be necessary to use some form of online topic modeling. In light of this, I may experiment with using the new .merge_models
function in BERTopic as another method of online topic modeling, especially if results are better using .fit
rather than .partial_fit
. Is this something you would recommend?
topic_model.transform(whole_text_data)
When you use .transform
it gives you output, namely topics
and probs
. You would still have to assign them to the internal topics:
topics, probs = topic_model.transform(whole_text_data)
topic_model.topics_ = topics
I completely agree that using .fit makes much more sense with this smaller dataset. However, I'm just experimenting with this smaller set of data before I run this procedure on a much larger dataset. I think it will eventually be necessary to use some form of online topic modeling. In light of this, I may experiment with using the new .merge_models function in BERTopic as another method of online topic modeling, especially if results are better using .fit rather than .partial_fit. Is this something you would recommend?
I would definitely recommend .fit
over .partial_fit
and using the .merge_models
function here. .partial_fit
is generally for cases where data is too large to train on and not really for dynamic topic modeling as you did in your example.
Also, note that you are using UMAP here which does not support .partial_fit
. This means that UMAP will be fitted during your very first .partial_fit
and will not learn any more in subsequent .partial_fit
calls. It is generally not advised to use UMAP with relatively small increments unless it was already trained on a larger dataset.
Lastly, it might also be worthwhile to select a model that is optimized for generated embeddings which I believe is not the case with finbert.
When you use
.transform
it gives you output, namely topics and probs. You would still have to assign them to the internal topics.
Ah right, this is my mistake. Thanks! This suggestion did seem to correct the issue where topic labels represented different topics over time.
I would definitely recommend .fit over .partial_fit and using the .merge_models function here. .partial_fit is generally for cases where data is too large to train on and not really for dynamic topic modeling as you did in your example. Also, note that you are using UMAP here which does not support .partial_fit. This means that UMAP will be fitted during your very first .partial_fit and will not learn any more in subsequent .partial_fit calls. It is generally not advised to use UMAP with relatively small increments unless it was already trained on a larger dataset. Lastly, it might also be worthwhile to select a model that is optimized for generated embeddings which I believe is not the case with finbert.
Thanks so much for these suggestions! I will certainly attempt to use .merge_models
and incorporate the other guidance you laid out above.
I really appreciate all the thoughtful responses you've given on this thread. You've done an amazing job in building this package and providing support for it!
Hello @MaartenGr, I've been loving this package so far! It's been extremely useful.
I have an inquiry regarding unexpected behavior in output from topics_over_time(). I've included code and output below but I will briefly contextualize the problem in words. I am using textual data from the Reuters Newswire from the year 2020. I use online topic modeling and monthly batches of the data to update my topic model. After this, I run topics_over_time() on the entire sample and use the months as my timestamps. All this works well. However, some of the same labels in topics_over_time() seem to represent vastly different topics in different points of time (the images below focus on label 18 as an example). It was my understanding that the label should represent the same overall topic over time, with the keywords changing based on how the corpus discusses the topic. However, the topic entirely shifts from the Iran nuclear deal to COVID-19.
Is there a way to prevent this from happening? It seems likely I've made some error in logic in my code (which I've included below).
Thanks so much in advance!