MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

5.79k stars 721 forks source link

Visualisation after merge_models #1700

Open mjaved-nz opened 6 months ago

mjaved-nz commented 6 months ago

Hi @MaartenGr ,

I am trying to visualise the topic after merging the models but getting an error. Could you please guide me to fix it?

Fitted model using two different datasets:

keybert = KeyBERTInspired()

mmr = MaximalMarginalRelevance(diversity=0.3)

representation_models = [keybert,mmr]

topic_model = BERTopic(language="english", top_n_words=100, verbose=True, representation_model=representation_models, vectorizer_model=CountVectorizer(ngram_range=(1, 10) , stop_words="english") ) print("Fitting model") topic_model.fit(docs)

This resulted in two models model1 & model2

loading models to merge topic_model1 = BERTopic.load("model1") topic_model2 = BERTopic.load("model2")

merging model merged_model = BERTopic.merge_models([topic_model1, topic_model2], min_similarity=0.99)

Visualisation (topic over time):

timestamps = df['Date'].to_list() topics_over_time = merged_model.topics_over_time(docs, timestamps, nr_bins=10, global_tuning = True, evolution_tuning = True) topic_over_time=merged_model.visualize_topics_over_time(topics_over_time, topics=[1,2,5,7,14 ], title='', width=800, height=400, custom_labels=True) topic_over_time.write_html('topics_over_time_test.html')

ERROR: topics_over_time = merged_model.topics_over_time(docs, timestamps, nr_bins=10, global_tuning = True, evolution_tuning = True) File ".conda\envs\bertopic2\lib\site-packages\bertopic_bertopic.py", line 768, in topics_over_time global_c_tf_idf = normalize(self.c_tfidf, axis=1, norm='l1', copy=False) File ".conda\envs\bertopic2\lib\site-packages\sklearn\utils_param_validation.py", line 204, in wrapper validate_parameter_constraints( File ".conda\envs\bertopic2\lib\site-packages\sklearn\utils_param_validation.py", line 96, in validate_parameter_constraints raise InvalidParameterError( sklearn.utils._param_validation.InvalidParameterError: The 'X' parameter of normalize must be an array-like or a sparse matrix. Got None instead.

Thanks

MaartenGr commented 6 months ago

When you merge two models, their c-TF-IDF representations are not merged since their tokenizers might be different. As a result, calculating c-TF-IDF representations is not possible with a merged model unless you merge the c-TF-IDF representations of the individual models (including the vectorizer).

mjaved-nz commented 6 months ago

Hi @MaartenGr

Thanks for your reply. I used the same following fit function in the same environment. I am not sure, how in this situation each model can get a different tokenizer. Could you please guide me to the solution?

topic_model_1=BERTopic(min_topic_size=5).fit(docs1) topic_model_2=BERTopic(min_topic_size=5).fit(docs2)

Combine all models into one

merged_model = BERTopic.merge_models([topic_model_1, topic_model_2], min_similarity=0.95)

where docs_all=docs1+docs2

timestamps = review_data_all['Date'].to_list() topics_over_time = merged_model.topics_over_time(docs_all, timestamps, nr_bins=10) topic_over_time=merged_model.visualize_topics_over_time(topics_over_time, top_n_topics=200) topic_over_time.write_html(f'{destination_path}topicovertime_Noguided.html')

Error:

topics_over_time = merged_model.topics_over_time(docs_all, timestamps, nr_bins=10) File ".conda\envs\bertopic2\lib\site-packages\bertopic_bertopic.py", line 768, in topics_over_time global_c_tf_idf = normalize(self.c_tfidf, axis=1, norm='l1', copy=False) File ".conda\envs\bertopic2\lib\site-packages\sklearn\utils_param_validation.py", line 204, in wrapper validate_parameter_constraints( File ".conda\envs\bertopic2\lib\site-packages\sklearn\utils_param_validation.py", line 96, in validate_parameter_constraints raise InvalidParameterError( sklearn.utils._param_validation.InvalidParameterError: The 'X' parameter of normalize must be an array-like or a sparse matrix. Got None instead.

mjaved-nz commented 6 months ago

Hi @MaartenGr,

I'm working on identifying the best solution for processing and visualizing discussion topics from various data sources. Currently, I extract data weekly and combine it with existing data to analyze the overall discussion trends. To avoid redundantly processing everything each week, I proposed merging the current week's model with the existing model for a more efficient overall analysis. However, I'm encountering an error (previously mentioned) that's preventing this approach. Could you suggest an alternative solution for my situation? Thanks

MaartenGr commented 6 months ago

Thanks for your reply. I used the same following fit function in the same environment. I am not sure, how in this situation each model can get a different tokenizer. Could you please guide me to the solution?

What I meant is that when you use .merge_models, BERTopic cannot assume that the same tokenizer is being used. As such, the CountVectorizer is not merged and will remain empty.

For instance, these two models have different tokenization schemes:

vectorizer_model=CountVectorizer(ngram_range=(1, 2) , stop_words="english")
vectorizer_model=CountVectorizer(ngram_range=(1, 3))

Instead, you would have to either merge the fitted CountVectorizers yourself or re-fit one of the CountVectorizer with the entire data.

The latter can be done as follows:

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]

# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)

# Merge models
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2])

# Prepare all documents
documents = pd.DataFrame(
    {
        "Document": abstracts_1 + abstracts_2,
        "ID": range(len(abstracts_1)+len(abstracts_2)),
        "Topic": merged_model.topics_,
        "Image": None
    }
)
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})

# Assign CountVectorizer to merged model
merged_model.vectorizer_model = topic_model_1.vectorizer_model

# Re-calculate c-TF-IDF
_, _ = merged_model._c_tf_idf(documents_per_topic)

Note that you will have to choose a vectorizer model from one of the fitted topic models.

mjaved-nz commented 6 months ago

Hi @MaartenGr,

Thanks for your reply. I'm still encountering the same error when trying to visualise the topics over time after merging the models in the following script. Could you please help me to identify any missing bits or potential errors within the script?

docs1 = review_data1.Text.to_list()
docs2 = review_data2.Text.to_list()

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(docs1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(docs2)

#merge models
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2])

#Prepare all documents
documents = pd.DataFrame(
    {
        "Document": docs1+docs2,
        "ID": range(len(docs1)+len(docs2)),
        "Topic": merged_model.topics_,
        "Image": None
    }
)

documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})

#Assign CountVectorizer to merged model
merged_model.vectorizer_model = topic_model_1.vectorizer_model

#Re-calculate c-TF-IDF
_, _ = merged_model._c_tf_idf(documents_per_topic)

#get dates
timestamps1 = review_data1['Date'].to_list()
timestamps2 = review_data2['Date'].to_list()
timestamps=timestamps1+timestamps2

#visualise topics over time
topics_over_time = merged_model.topics_over_time(docs1+docs2, timestamps, nr_bins=10) 
topic_over_time=merged_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)
topic_over_time.write_html('topicovertime_Noguided.html')

Error:

topics_over_time = merged_model.topics_over_time(docs1+docs2, timestamps, nr_bins=10) File ".conda\envs\bertopic2\lib\site-packages\bertopic_bertopic.py", line 768, in topics_over_time global_c_tf_idf = normalize(self.c_tfidf, axis=1, norm='l1', copy=False) File ".conda\envs\bertopic2\lib\site-packages\sklearn\utils_param_validation.py", line 204, in wrapper validate_parameter_constraints( File ".conda\envs\bertopic2\lib\site-packages\sklearn\utils_param_validation.py", line 96, in validate_parameter_constraints raise InvalidParameterError( sklearn.utils._param_validation.InvalidParameterError: The 'X' parameter of normalize must be an array-like or a sparse matrix. Got None instead.

MaartenGr commented 6 months ago

Ah right, instead of this:

# Re-calculate c-TF-IDF
_, _ = merged_model._c_tf_idf(documents_per_topic)

I think it should be something like this, but you'll have to check:

# Re-calculate c-TF-IDF
c_tf_idf, _ = merged_model._c_tf_idf(documents_per_topic)
merged_model.c_tf_idf_ = c_tf_idf

mjaved-nz commented 6 months ago

Thanks @MaartenGr,

It works for me.

Some more queries:

I am getting NAN for all topics in Representative_Docs of merged_model. How can I get Representative_Docs after merging the models?
By default, there are 4 Representative_Docs for each topic. can I increase it without using the LLMs?

MaartenGr commented 6 months ago

I am getting NAN for all topics in Representative_Docs of merged_model. How can I get Representative_Docs after merging the models?

You would have to do this manually. Since representative documents might leak personal data in case of a federated learning task, this is not kept when merging the documents.

By default, there are 4 Representative_Docs for each topic. can I increase it without using the LLMs?

There is currently no method for that but if you check some of the open/closed issues, you will find several discussing how you could increase that yourself with some of the internal function. For example, this issue. Although you would to check yourself.