MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

HELP! CountVectorizer and ClassTfidfTransformer changing clustering results #1193

Closed jdweaver14 closed 1 year ago

jdweaver14 commented 1 year ago

Discussed in https://github.com/MaartenGr/BERTopic/discussions/1192

Originally posted by **jdweaver14** April 17, 2023 Hi everyone, I apologize for not posting images here, but I am not at liberty to share the data I am working with. I am new to BERTopic, but I was under the impression that the ClassTfidfTransformer and CountVectorizer steps were only used after embeddings (that did not use preprocessing steps) and clustering to improve topic representation. However, when I run the same dataset with fixed random_state through the model with and without either or both of these steps introduced, I get both a different number of clusters and difference counts per cluster. This should not be happening if these steps are only used on the already calculated cluster information. Please help me understand what I am missing!
MaartenGr commented 1 year ago

Seeing as this is discussed in #1192, I am going to close this in favor of that discussion.