MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

. . . About pyLDAvis Visualization in BERTopic #196

Closed gsalfourn closed 3 years ago

gsalfourn commented 3 years ago

@MaartenGr

I am an R enthusiast who is new to Python. I have read your posts on "Interactive Topic Modeling with BERTopic", and it predecessor, "Topic Modeling with BERT". Thanks for an awesome package. I know BERTopic has a visualization similar to pyLDAVis, I was wondering if it's possible to extract information from BERTopic that can be used in pyLDAvis.

To visualize BERTopic using pyLDAvis, I would need the topic-term distributions, document-topic distributions, and information about the corpus which the model was trained on

MaartenGr commented 3 years ago

Thank you for your kind words!

I believe it should be possible to visualize BERTopic using pyLDAvis, although I have not done so myself. The main issue with doing so is that the topic-term distributions will not entirely be accurate. This has mostly to do with how BERTopic creates those representations.

There are two steps involved in creating the topic representations. First, we apply c-TF-IDF to the clusters of documents to generate candidate words for each topic. This would be your topic-term distributions that you could use for pyLDAvis. The second step leverages MMR to make sure that the topic representations are a bit more coherent and stable. However, this does not generate a topic-term distribution but is merely a selection of terms.

In other words, the topic-term distributions generated in the first step do not perfectly match the topic representations as generated in the second. The reason for me explaining this, is that the visualization you will get in pyLDAvis is an un-optimized view of BERTopic. By no means is it a poor view, but just not the entire picture.

MaartenGr commented 3 years ago

Technically, this is how you would approach using BERTopic with pyLDAvis. However, it does not seem to work as of right now due to a nasty Int64Index error which I cannot figure out:

import pyLDAvis
import numpy as np
from bertopic import BERTopic

# Train Model
topic_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)

# Prepare data for PyLDAVis
top_n = 5

topic_term_dists = topic_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in docs]
vocab = [word for word in topic_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [topic_model.vectorizer_model.vocabulary_[word] for word in vocab]

data = {'topic_term_dists': topic_term_dists,
        'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths,
        'vocab': vocab,
        'term_frequency': term_frequency}

# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)

Having said that, it might work on your dataset.

gsalfourn commented 3 years ago

Thanks for the prompt response. I tried out the code on a small sample of data (just to see how it will work out). Besides some deprecation warnings, I also got an error message.

2021-08-09 22:31:22,038 - BERTopic - Transformed documents to Embeddings
2021-08-09 22:31:27,828 - BERTopic - Reduced dimensionality with UMAP
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:275: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:56: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  condensed_tree = condense_tree(single_linkage_tree,
c:\python\python39\lib\site-packages\hdbscan\hdbscan_.py:59: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels, probabilities, stabilities = get_clusters(condensed_tree,
2021-08-09 22:31:27,852 - BERTopic - Clustered UMAP embeddings with HDBSCAN
---------------------------------------------------------------------------

The error message was related to _number of rows of topic_term_dists does not match number of columns of doc_topicdists

ValidationError                           Traceback (most recent call last)
<ipython-input-2-775585b401be> in <module>
     27 
     28 # Visualize using pyLDAvis
---> 29 vis_data= pyLDAvis.prepare(**data, mds='mmds')
     30 pyLDAvis.display(vis_data)

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics, start_index)
    413     doc_lengths = _series_with_name(doc_lengths, 'doc_length')
    414     vocab = _series_with_name(vocab, 'vocab')
--> 415     _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    416     R = min(R, len(vocab))
    417 

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     72     res = _input_check(*args)
     73     if res:
---> 74         raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     75 
     76 

ValidationError: 
 * Number of rows of topic_term_dists does not match number of columns of doc_topic_dists; both should be equal to the number of topics in the model.
MaartenGr commented 3 years ago

Could you share the entire snippet of code that you tried it out on? It could also be that you simply need more data for this to work.

gsalfourn commented 3 years ago

@MaartenGr

Below is the code I used along with the text data

## import regex module
import re

## path to the data file
# path = 'D:/Python/bertopic/fl_data/fl_data_excerpts.txt'

## reading the data
with open(path, 'r', encoding='utf8') as f:
    contents = f.read()
    line_tabs = re.sub('\t', ' ', contents)
    line_spaces = re.sub(' +', ' ', line_tabs)
    text_data = re.split(r"\.|\?|\!", line_spaces)

print(text_data[:5])
print(type(text_data))

## pyLDAvis implementation
import pyLDAvis
import numpy as np
from bertopic import BERTopic

# Train Model
topic_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = topic_model.fit_transform(text_data)

# Prepare data for PyLDAVis
top_n = 5

topic_term_dists = topic_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in text_data]
vocab = [word for word in topic_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [topic_model.vectorizer_model.vocabulary_[word] for word in vocab]

data = {'topic_term_dists': topic_term_dists,
        'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths,
        'vocab': vocab,
        'term_frequency': term_frequency}

# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)

I have attached the snippet of text used fl_data_excerpts.txt

MaartenGr commented 3 years ago

Sorry it took a while to figure this out but it seems that the text you used is simply too small to be usable in this case. Since it only generates 2 topics, one of which being outliers, there are issues with slicing the data.

If you use a larger dataset that generates multiple topics, like 20Newsgroups it should not give that ValidationError.

gsalfourn commented 3 years ago

Hi Maarten,

Thanks for your assistance. I ran the model on a larger dataset, but now it appears that some of the _topic_termdistributions have very small probabilities, so their sum does not equal 1. Below is the message I get. Any suggestion(s) on how to resolve this issue?

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-12-f75802444872> in <module>
     29 
     30 # Visualize using pyLDAvis
---> 31 vis_data= pyLDAvis.prepare(**data, mds='mmds')
     32 pyLDAvis.display(vis_data)

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics, start_index)
    413     doc_lengths = _series_with_name(doc_lengths, 'doc_length')
    414     vocab = _series_with_name(vocab, 'vocab')
--> 415     _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    416     R = min(R, len(vocab))
    417 

c:\python\python39\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     72     res = _input_check(*args)
     73     if res:
---> 74         raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     75 
     76 

ValidationError: 
 * Not all rows (distributions) in topic_term_dists sum to 1.
MaartenGr commented 3 years ago

That is strange since the topic_term_dists never were summing to 1 as those values do not represent probabilities at all. Have you changed any of the code? You could try to normalize the c-TF-IDF matrix and have it sum to 1.

MaartenGr commented 3 years ago

Due to inactivity, I will be closing this for now. However, if you run into issues please let me know and I'll re-open the issue!

rafaelvalero commented 2 years ago

Thanks for the comments, I took the above to visualise the topics and looks fine. In case: https://github.com/rafaelvalero/different_notebooks/blob/master/bertopics_pyldavis.ipynb

bala1802 commented 2 years ago

Hi Maarten,

outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)

The code snippet will work only when the Outlier is present in the first index. Sometimes the BerTopic is generating the outlier -1 in the different location.

image

As you can see in the above, the Topic -1 is present at index = 4

spookyuser commented 2 years ago

Just fyi it would c_tf_idf_ now :)

allanckw commented 1 year ago

Hi, I think there is a version change and probs is no longer a 2d Array? I am unable to use the same code by rafaelvalero on the trained model using 0.14.0

MaartenGr commented 1 year ago

@allanckw The probabilities are either 1d or 2d dependent on whether you have set calculate_probabilities=True.

allanckw commented 1 year ago

Hi @MaartenGr

Thanks for the tip! 👍

Off to retrain my models