Possibility of acquiring topics per class over time

pearlyquek commented 2 years ago

Hi Maarten,

Thank you for the fantastic work - I’m a huge fan!

I’m currently working on a corpus with documents from 2018 to 2021, created by a few authors. Would it be possible to generate topics by class (i.e. the authors), and then over time (i.e. annually) in the same model, or would I have to run different models for each class? Thank you!

Cheers, Pearly

MaartenGr commented 2 years ago

Thank you for the kind words!

That is currently not possible in BERTopic as it would require quite a number of slices resulting in too little data left to create accurate topic representations. However, with a sufficiently large dataset it should be possible.

For now, if your dataset is large enough, I would indeed advise splitting up the data in either classes or years and then running either topics_per_class or topics_over_time.

pearlyquek commented 2 years ago

Thank you for getting back so promptly, Maarten! Appreciate it :)

pearlyquek commented 2 years ago

Hi Maarten!

Coming back to this topic, I was wondering if the following is possible?

Suppose my preference is to run topics_over_time.

I have a master corpus of about 100k documents. Author A authored ~8k, Author B ~10k, so on and so forth. I have already tuned and fitted the model on the master corpus, and have derived the topic for each doc quite satisfactorily.

Would it be possible to save the fitted model and deploy on a smaller dataframe with only Author A’s documents? There, I can simply run the visualisations e.g. topics_over_time. I’ll like to avoid fitting separate models for Author A, Author B etc because they technically belong to the master corpus that my model was fitted on.

Or would the following alternative work? After saving the topics to my master dataframe, I simply filter by Author and plot the topics over time using matplotlib. The visualisations won’t be as awesome, but it’ll work for my current use case.

Thanks again!

MaartenGr commented 2 years ago

I have a master corpus of about 100k documents. Author A authored ~8k, Author B ~10k, so on and so forth. I have already tuned and fitted the model on the master corpus, and have derived the topic for each doc quite satisfactorily.

Interesting use case! With that many documents, from a theoretical perspective, it should be possible to slice on both year and author and still get accurate representations.

I’ll like to avoid fitting separate models for Author A, Author B etc because they technically belong to the master corpus that my model was fitted on.

I agree. Typically, we want to fit on the entire corpus (perhaps additionally using semi-supervised topic modeling) to get the best representation of the general topics before delving into specific classes.

Would it be possible to save the fitted model and deploy on a smaller dataframe with only Author A’s documents?

Yes, and no, sort off... There are two scenario's that I can think of that might work for you.

Scenario 1

After fitting your topic modeling on all the data, as you did in your example, we use topics_over_time on only a selection of the data, for example, Author A. However, this does not allow you to use visualizations like .visualize_barchart() for only Author A as the model was trained on all the data.

To do this, let's take the Trump example:

import re
import pandas as pd
from bertopic import BERTopic

# Prepare data
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

# Train model
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)

After having trained your model on the entire dataset, when using .visualize_topics_over_time() we only select a portion of the dataset to visualize. Here, it is the first 5000 documents, but in your case it would be all documents, topics, and timestamps related to Author A:

topics_over_time = topic_model.topics_over_time(tweets[:5_000], topics[:5_000], timestamps[:5_000], nr_bins=20)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)

This allows you to use .visualize_topics_over_time to be used across all authors but does not allow you to use all the other visualization methods for each author separately.

Scenario 2

In this scenario, we try to update BERTopic such that although it was trained on the entire dataset, we subset the data and update the topic representations for specifically that subset. In other words, after training the model on all authors, we update the model to contain the topic representations for Author A specifically.

To show how it works, we take the 20 NewsGroups dataset:

from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic

# Prepare data
data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data["data"]
categories = [data["target_names"][category] for category in data["target"]]

# Train model on all data
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After training our model on all the data, we are going to select only the documents that only contain the category "sci.electronics". To do so, we first need to prepare our data in a certain format, otherwise it will not be accepted by BERTopic:

import pandas as pd
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics,
                          "Category": categories})

# We slice the data and extract the topics from that subset
# In other words, instead of "sci.electronics", you would select "Author A"
subset = documents.loc[documents.Category == "sci.electronics", :]
subset_labels = sorted(list(subset.Topic.unique()))

Then, we are going to update the internal topic representations to match the subset that we just created:

# First, we group the documents per topic
documents_per_topic = subset.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})

# Then we calculate the c-TF-IDF representation but we do not fit this method 
# as it was already fitted on the entire dataset
topic_model.c_tf_idf, words = topic_model._c_tf_idf(documents_per_topic, fit=False)

# Lastly, we extract the words per topic based on the subset_labels,
# and we update the topic size to correspond with the subset
topic_model.topics = topic_model._extract_words_per_topic(words, labels=subset_labels)
topic_model._create_topic_vectors()
topic_model.topic_names = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
                    for key, values in
                    topic_model.topics.items()}
topic_model._update_topic_size(subset)

After doing the above, you can use .visualize_topics(), .visualize_barchart(), etc. If I am not mistaken, .get_topic_info() should also now contain the frequencies of the subset. Moreover, in theory, it should also allow you to use .topics_over_time() just make sure you input the docs, topics, and timestamps that are only related to the subset (Author A) as we have removed all information about all other authors.

From what I can gather, this second scenario is closest to what you want to achieve. However, it does require playing around with the internal workings of BERTopic which means that I cannot promise everything will go flawlessly although I do expect it to work quite well.

pearlyquek commented 2 years ago

Hi Maarten,

Thank you for your advice! I tried Scenario 2 and it worked like a charm :) Appreciate your help as always!

MaartenGr commented 2 years ago

No problem, glad I could be of help :)

austchan commented 1 year ago

Hi Maarten,

I've been using BERTopic to model topics discussed in some 50K Telegram messages across 5 channels. I'd like to also produce a dynamic model per channel, similar to the use case above. I've already trained a global model, and reloading this saved model for global dynamic topic modeling works like a charm. However, when I attempt to perform the scenarios above, it produces an error that 'topics' is undefined, likely because I'm not rerunning my model, but rather loading the saved model and going from there. Any idea of a workaround here? Thanks in advance for your help and for this great tool!

MaartenGr commented 1 year ago

@austchan The topics can also be accessed with topic_model.topics_ as they are also saved internally.

austchan commented 1 year ago

Thanks for the quick response (and apologies for the novice-level questions). I'm now getting a strange keyword argument for the 'labels' parameter in _extract_words_per_topic(). I've tried to trouble shoot a few different things, but keep yielding the same results. Any ideas? Screenshot attached. Thanks again!!

MaartenGr commented 1 year ago

Function that start with _ are those that do not impact the API of the software directly and as such are often changed between versions. I believe the code above is from an older version of BERTopic that used to have the labels parameter. It is replaced with the documents parameter that is created as such:

documents = pd.DataFrame({"Document": documents,
                          "ID": range(len(documents)),
                          "Topic": model.topics_})

Here, however, you need to make sure documents is the subset that you are looking for.

In practice, I would advise checking what the latest version of BERTopic was when I wrote that code and use that version instead. Otherwise, you will have to change a lot of internal code in order to make this work.

austchan commented 1 year ago

Okay great, thanks for the help!

MaartenGr / BERTopic

Possibility of acquiring topics per class over time #463

Scenario 1

Scenario 2