Is there a way to retrieve the words used to generate the tf-idf?

sgdantas commented 2 years ago

Hey, I saw this issue and I wanted to get the P(word|topic) https://github.com/MaartenGr/BERTopic/issues/144

You suggested accessing it using model.c_tf_idf, but I still need the words that were used to generate the sparse matrix. By looking at the source code, I saw where that's defined, but it doesn't seem easy to access.

Is there a "standard" way to get it?

Thanks!

MaartenGr commented 2 years ago

The model.c_tf_idf matrix is generated from the CountVectorizer used to model the frequencies of words. You can use model.vectorizer_model to access that vectorizer and extract the words through model.vectorizer_model.get_feature_names().

Hopefully, this helps!

sgdantas commented 2 years ago

That works, thanks Maarten! I just have two last questions:

the sparse matrix is (num_topics +1) x (num_words), is the last row related to the topic -1?
I read on the docs that cleaning the text is not necessary, as we can often rely on the contextual embeddings to "get" the overall meaning of documents. However, when looking at the words, I see lots of numbers that don't seem very useful and for some topics, I see stopwords as the top_n_words. In this case, do you think is worth it to clean a bit the text? The overall topic quality is ok, but I wonder if I can improve it. Thanks a lot for all your hard work, and for making it available for everyone :)

MaartenGr commented 2 years ago

No problem, glad to hear that it works!

the sparse matrix is (num_topics +1) x (num_words), is the last row related to the topic -1?

No, the first row is related to topic -1, then 0, then, 1, etc. So if you want to access the c-TF-IDF representations for topic 23, you will have to access topic_model.c_tf_idf[23+1].

In this case, do you think is worth it to clean a bit the text?

In general, it is not necessary to clean the text. However, like in your case, that does not mean that it will never be helpful. In your use case, it seems that stopwords are finding their way into the topic representations and I can definitely imagine not wanting them there.

There are two ways of approaching this. First, you can indeed clean the text up a bit. It might negatively influence the clustering quality but I would not be too worried about that. This way, you are focusing on the text directly which influences both clustering and topic representations. Second, you can focus on only changing the topic representation through using a custom CountVectorizer when instantiating BERTopic. In this vectorizer, you can set stop_words that will only remove the stopwords when creating the topic representation, not when creating the embeddings. I believe the second option will be quickest to implement and most likely result in the most improvement.

sgdantas commented 2 years ago

awesome, thanks for the suggestions!

MaartenGr / BERTopic

Is there a way to retrieve the words used to generate the tf-idf? #331