MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Is there a way to retrieve the words used to generate the tf-idf? #331

Closed sgdantas closed 2 years ago

sgdantas commented 2 years ago

Hey, I saw this issue and I wanted to get the P(word|topic) https://github.com/MaartenGr/BERTopic/issues/144

You suggested accessing it using model.c_tf_idf, but I still need the words that were used to generate the sparse matrix. By looking at the source code, I saw where that's defined, but it doesn't seem easy to access.

Is there a "standard" way to get it?

Thanks!

MaartenGr commented 2 years ago

The model.c_tf_idf matrix is generated from the CountVectorizer used to model the frequencies of words. You can use model.vectorizer_model to access that vectorizer and extract the words through model.vectorizer_model.get_feature_names().

Hopefully, this helps!

sgdantas commented 2 years ago

That works, thanks Maarten! I just have two last questions:

MaartenGr commented 2 years ago

No problem, glad to hear that it works!

the sparse matrix is (num_topics +1) x (num_words), is the last row related to the topic -1?

No, the first row is related to topic -1, then 0, then, 1, etc. So if you want to access the c-TF-IDF representations for topic 23, you will have to access topic_model.c_tf_idf[23+1].

In this case, do you think is worth it to clean a bit the text?

In general, it is not necessary to clean the text. However, like in your case, that does not mean that it will never be helpful. In your use case, it seems that stopwords are finding their way into the topic representations and I can definitely imagine not wanting them there.

There are two ways of approaching this. First, you can indeed clean the text up a bit. It might negatively influence the clustering quality but I would not be too worried about that. This way, you are focusing on the text directly which influences both clustering and topic representations. Second, you can focus on only changing the topic representation through using a custom CountVectorizer when instantiating BERTopic. In this vectorizer, you can set stop_words that will only remove the stopwords when creating the topic representation, not when creating the embeddings. I believe the second option will be quickest to implement and most likely result in the most improvement.

sgdantas commented 2 years ago

awesome, thanks for the suggestions!