Closed fireindark707 closed 3 years ago
Update: Although I used version 3.5.1, I still encountered problems. The result of distinguishing the topics has become very bad.Please do not try this method.
Glad to hear the project could be of help.
The issue here with Flair is that its dependencies conflict with those of Sentence-Transformers as they require different and incompatible versions of the same dependencies. The pip install then has difficulties choosing which version to use. However, I can look into using the newest version of Flair (0.8) and specifically set the transformers version in the pip install bertopic[flair]
procedure to see if that helps out a bit. It will take some experimentation so it can take a few weeks before it might be released.
I am curious though as to why you encountered badly distinguished topics. If I use 'crawl' in Flair with transformers 3.5.1 the topics seem to be nicely separated. Could you share the code for creating the topic model? Perhaps it was not the transformer package version that could be the culprit.
Thank you for your response. Here is my code, I used the same bert-base-chinese model(google), but I got different clustering performance with different versions of transformers.
import os
import json
from sklearn.feature_extraction.text import CountVectorizer
import jieba
from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings
import re
def tokenize_zh(text):
words = jieba.lcut(text)
return words
model_name = 'bert-base-chinese'
with open("chinese_stop_words.json","r") as f:
stop_words = json.load(f)
vectorizer = CountVectorizer(tokenizer=tokenize_zh,stop_words=stop_words)
bert_zh = TransformerDocumentEmbeddings(model_name)
topic_model = BERTopic(embedding_model=bert_zh,top_n_words=15,min_topic_size=20, verbose = True,vectorizer_model=vectorizer)
topics, probabilities = topic_model.fit_transform(docs)
At least one thing I can note is that BERTopic, due to the inclusion of UMAP, is stochastic. This means that each time you will run BERTopic, regardless of whether you changed parameters, the resulting topics will differ. Therefore, it is difficult to state that the clustering performance is different with different versions of transformers since multiple runs will differ by definition.
At least one thing I can note is that BERTopic, due to the inclusion of UMAP, is stochastic. This means that each time you will run BERTopic, regardless of whether you changed parameters, the resulting topics will differ. Therefore, it is difficult to state that the clustering performance is different with different versions of transformers since multiple runs will differ by definition.
Thank you for your reply, I will try more experiments.
Many thanks to this project, which provides much help in text analysis, especially for social science researchers less familiar with code.
However, I have encountered a problem with BERTopic that has bothered me for a long time. If I use the latest version of transformers, loading models on huggingface is very easy to encounter various errors, including tokenizer.
Later, I changed transformer to 3.5.1 according to flair's version (0.7), and everything was saved smoothly.