MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

Updating topic frequencies #1675

Open sdipti opened 11 months ago

sdipti commented 11 months ago

Thank you for creating this! I do not come from a coding background and this the first time I have would with a model - BERTopic made it really easy for me to do my analysis and also made it easier to understand how models work.

I got an output after fitting the model, which I then ran through topic distributions. After checking the output of topic distributions, I saw that the topic distributions has a better mapping of the topics to the documents. Eg, document X was assigned to topic 1 in the initial model with a lesser probability - the topic distributions showed that topic 10 is a better match for document X.

Therefore, I want to assign the highest probabilities from the topic distributions array. In order to do this, I extracted the topic list (in the same format as 'topics') with topics assigned to each row in the dataframe - based on the highest probability.

Now, when I am trying to update the topic frequencies using:

documents = pd.DataFrame({"Document": docs, "Topic": topics2}) loaded_model._update_topic_size(documents)

This is the output I get:

image

I am unable to run any visualizations as well:

image

Please help me in getting about this issue. Also, is there a way to integrate 'topic distributions' during the process of fitting the model? I feel that may give me better results.

Thanks a lot!

MaartenGr commented 11 months ago

You should not use private functions (loaded_model._update_topic_size(documents)) in general since any support cannot be given. They tend to evolve over time and can easily break between versions. Instead, if you reassign topics, I would advise using the .update_topics function instead. You can find more about that here.

sdipti commented 11 months ago

I get the same error when using the .update_topics function

On Thu, 7 Dec, 2023, 9:59 pm Maarten Grootendorst, @.***> wrote:

You should not use private functions ( loaded_model._update_topic_size(documents)) in general since any support cannot be given. They tend to evolve over time and can easily break between versions. Instead, if you reassign topics, I would advise using the .update_topics function instead. You can find more about that here https://maartengr.github.io/BERTopic/getting_started/topicrepresentation/topicrepresentation.html .

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/1675#issuecomment-1845656205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOTGGNJZNNMREP2MLINJXILYIHVGFAVCNFSM6AAAAABALKAKAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBVGY2TMMRQGU . You are receiving this because you authored the thread.Message ID: @.***>

MaartenGr commented 11 months ago

Could you share your full code when using .update_topics?

sdipti commented 11 months ago

Could you share your full code when using .update_topics?

! pip install bertopic from bertopic import BERTopic

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/my_files/dataframe.csv', index_col = False) docs = df['text']

loaded_model = BERTopic.load("/content/drive/MyDrive/BERTopic")

topic_distributions_df = pd.read_csv('/content/drive/MyDrive/my_files/topic_distributions_df.csv', index_col = False)

topics2 = topic_distributions_df.idxmax(axis=1)

topics2 = topics2.tolist()

loaded_model.update_topics(docs, topics=topics2)

MaartenGr commented 11 months ago

Thanks for sharing the code. I am missing how you trained the model and saved those distributions. Without it, I can't really reproduce the issue you are facing. Make sure that you share an end-to-end example, including the code for getting the issue/error.

sdipti commented 11 months ago

Loaded the dataset:

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/my_files/dataframe.csv', index_col = False)
docs = df['text']

I trained my model using these codes:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA
from hdbscan import HDBSCAN
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)

from umap import UMAP

dim_model = UMAP(n_neighbors=7, n_components=5, min_dist=0.0, metric='cosine', random_state=42) 

cluster_model = HDBSCAN(min_cluster_size=20, min_samples=1, metric='euclidean', prediction_data=True)

vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=0.001)

representation_model = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.4)]

topic_model = BERTopic(
    embedding_model=sentence_model, 
    umap_model=dim_model, 
    hdbscan_model=cluster_model, 
    vectorizer_model=vectorizer_model, 
    representation_model=representation_model,

    top_n_words=20)

topics, probs = topic_model.fit_transform(docs, embeddings)

I then went onto do the topic distributions:

topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=500, window = 40, stride = 30, use_embedding_model=True)

Saved my model:

embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("/content/drive/MyDrive/BERTopic", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

I realised that when saving the model, topic distributions were not getting saved. Hence saved it in a separate dataframe:

topic_distributions_df = pd.DataFrame(topic_distr)

topic_distributions_df.to_csv("topic_distributions_df.csv")
!cp topic_distributions_df.csv /content/drive/MyDrive/my_files

Post this, I imported my model in a separate session - the codes for which I shared in the previous comment.

sdipti commented 11 months ago

When loading the model and topic distributions separately, I can still generate the visualizations for probabilities after converting "topic_distributions_df" to an array.

topic_distributions = topic_distributions_df.to_numpy()

loaded_model.visualize_distribution(topic_distributions[15])

The above seems to be working fine, but the issue arises when I try to reassign the topics to the documents.

MaartenGr commented 11 months ago

I believe you are not correctly saving the topic distributions to file. Note that when you run the following:

import pandas as pd
topic_distributions_df = pd.DataFrame(topic_distr)
topic_distributions_df.to_csv("topic_distributions_df.csv")
topic_distributions_df

You are saving the topic distributions including the index column.

So when you run the following:

topic_distributions_df = pd.read_csv('topic_distributions_df.csv', index_col = False)
topic_distributions_df

Make sure you either skip saving the index or drop it when you load it. So something like this:

topic_distributions_df.drop('Unnamed: 0', axis=1)

A small tip. When you share your code, make sure it is your full code. This means that it should be an end-to-end example. Even the code you shared training the model is missing the code for updating the frequencies. As a result, I have to stitch together code from several of your messages which is quite time consuming.

sdipti commented 11 months ago

Thanks for your response. I had already removed the index from my df. I did it again just now. Sharing the entire code below.

! pip install bertopic
from bertopic import BERTopic

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/my_files/dataframe.csv', index_col = False)
docs = df['text']

loaded_model = BERTopic.load("/content/drive/MyDrive/BERTopic")

loaded_model.get_topic_info()

I get the following output:

image

I then import the topic distributions:

topic_distributions_df = pd.read_csv('/content/drive/MyDrive/my_files/topic_distributions_df.csv', index_col = False)

topic_distributions_df = topic_distributions_df.drop(['Unnamed: 0'], axis = 1)

topic_distributions_df

This is the output:

image

Then I find the topics which have the highest probability values.

topics2 = topic_distributions_df.idxmax(axis=1)

topics2

I get the following output:

image

Then I create a list because topics is a list type file:

new_topic_freq = topics2.tolist()

new_topic_freq

I get a list like this:

image

Then I use .update_topics :

loaded_model.update_topics(docs, topics=new_topic_freq)

loaded_model.get_topic_info()

I get the following output, the topic frequencies have been updated, but the topic names and representations have vanished:

image

I try to visualise topics but that fails, probably because topic names and representations have vanished:

image

image

After this if I load the model again, and try to use .update_topics with topics2 dataframe. I get exactly the same output as above:

loaded_model = BERTopic.load("/content/drive/MyDrive/BERTopic")

loaded_model.update_topics(docs, topics=topics2)

I also try to use the new_topic_freq list with ._update_topic_size. To do this I reload the model and run the following code:

loaded_model = BERTopic.load("/content/drive/MyDrive/BERTopic")

documents = pd.DataFrame({"Document": docs, "Topic": new_topic_freq})

loaded_model._update_topic_size(documents)

loaded_model.get_topic_info()

I get the following output (same as above):

image

The topic frequencies got updated but names and representations have vanished.

I hope I was able to explain better this time. How can I get around this issue? Also is there a way to integrate topic_distr into the model while fitting the model? Even if that's not possible, can I just extract the highest probabilities from the topic_distr and feed it into the topic_model directly?

Thanks a lot!

MaartenGr commented 11 months ago

Thanks for your response. I had already removed the index from my df.

That's the thing, in the code you shared this was not the case. That's why it's so important to share your full and complete example as during debugging that's the first thing I stumbled upon.

Sharing the entire code below.

Thank you for sharing more of the code. Do note that the training procedure, saving the distributions, and saving the model are all missing from this code. Again, make sure to share an end-to-end example.

Having said that, the issue here is that you passed the topics as a list of strings instead of a list of integers. The following worked for me and can be used in the future as an illustration of an end-to-end example:

# Installation:
# !pip install datasets bertopic safetensors

import pandas as pd
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

# Embed data
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Sub models
dim_model = UMAP(n_neighbors=7, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
cluster_model = HDBSCAN(min_cluster_size=20, min_samples=1, metric='euclidean', prediction_data=True)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=0.001)
representation_model = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.4)]

# BERTopic
topic_model = BERTopic(
    embedding_model=sentence_model,
    umap_model=dim_model,
    hdbscan_model=cluster_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    top_n_words=20
)
topics, probs = topic_model.fit_transform(docs, embeddings)

# Save BERTopic
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("my_model", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

# Calculate distributions
topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=500, window = 40, stride = 30, use_embedding_model=True)
topic_distributions_df = pd.DataFrame(topic_distr)
topic_distributions_df.to_csv("topic_distributions_df.csv")

# Load model and distributions
loaded_model = BERTopic.load("/content/drive/MyDrive/BERTopic")
topic_distributions_df = pd.read_csv('topic_distributions_df.csv', index_col = False)
topic_distributions_df = topic_distributions_df.drop(['Unnamed: 0'], axis = 1)

# Update Topics
topics2 = topic_distributions_df.idxmax(axis=1)
new_topic_freq = topics2.tolist()
new_topic_freq = [int(x) for x in new_topic_freq]
loaded_model.update_topics(docs, topics=new_topic_freq)

# Check output
loaded_model.get_topic_info()
sdipti commented 11 months ago

This worked. Thanks a ton!

sdipti commented 11 months ago

One more thing - after working the above, the topics representations have also changed - which is ok. But it seems that less relevant keywords are showing up than before. Please see before and after screenshots below.

BEFORE:

image

AFTER:

image

Can I run it through KeyBertInspired and MMR again?

MaartenGr commented 11 months ago

Ah right, I forgot! If you want to use the same representations:

loaded_model.update_topics(docs, topics=new_topic_freq, representation_model=representation_model)
sdipti commented 11 months ago

Great! Thank you.

jjsnlee commented 10 months ago

Thanks @MaartenGr for that example, that kind of end-to-end was really helpful! One question, why do you save down the results of approximate_distribution instead of the topics output from the fit_transform call?

MaartenGr commented 10 months ago

@jjsnlee The example I showed followed the OPs original code, so I merely adjusted that to work in line with what the OP wanted to achieve.