Closed sgdantas closed 2 years ago
The model.c_tf_idf
matrix is generated from the CountVectorizer used to model the frequencies of words. You can use model.vectorizer_model
to access that vectorizer and extract the words through model.vectorizer_model.get_feature_names()
.
Hopefully, this helps!
That works, thanks Maarten! I just have two last questions:
top_n_words
. In this case, do you think is worth it to clean a bit the text? The overall topic quality is ok, but I wonder if I can improve it.
Thanks a lot for all your hard work, and for making it available for everyone :) No problem, glad to hear that it works!
the sparse matrix is (num_topics +1) x (num_words), is the last row related to the topic -1?
No, the first row is related to topic -1, then 0, then, 1, etc. So if you want to access the c-TF-IDF representations for topic 23, you will have to access topic_model.c_tf_idf[23+1]
.
In this case, do you think is worth it to clean a bit the text?
In general, it is not necessary to clean the text. However, like in your case, that does not mean that it will never be helpful. In your use case, it seems that stopwords are finding their way into the topic representations and I can definitely imagine not wanting them there.
There are two ways of approaching this. First, you can indeed clean the text up a bit. It might negatively influence the clustering quality but I would not be too worried about that. This way, you are focusing on the text directly which influences both clustering and topic representations. Second, you can focus on only changing the topic representation through using a custom CountVectorizer
when instantiating BERTopic. In this vectorizer, you can set stop_words
that will only remove the stopwords when creating the topic representation, not when creating the embeddings. I believe the second option will be quickest to implement and most likely result in the most improvement.
awesome, thanks for the suggestions!
Hey, I saw this issue and I wanted to get the P(word|topic) https://github.com/MaartenGr/BERTopic/issues/144
You suggested accessing it using
model.c_tf_idf
, but I still need the words that were used to generate the sparse matrix. By looking at the source code, I saw where that's defined, but it doesn't seem easy to access.Is there a "standard" way to get it?
Thanks!