MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

Suggest reminding users to install transformers 3.5.1 when using flair #134

Closed fireindark707 closed 3 years ago

fireindark707 commented 3 years ago

Many thanks to this project, which provides much help in text analysis, especially for social science researchers less familiar with code.

However, I have encountered a problem with BERTopic that has bothered me for a long time. If I use the latest version of transformers, loading models on huggingface is very easy to encounter various errors, including tokenizer.

Later, I changed transformer to 3.5.1 according to flair's version (0.7), and everything was saved smoothly.

fireindark707 commented 3 years ago

Update: Although I used version 3.5.1, I still encountered problems. The result of distinguishing the topics has become very bad.Please do not try this method.

MaartenGr commented 3 years ago

Glad to hear the project could be of help.

The issue here with Flair is that its dependencies conflict with those of Sentence-Transformers as they require different and incompatible versions of the same dependencies. The pip install then has difficulties choosing which version to use. However, I can look into using the newest version of Flair (0.8) and specifically set the transformers version in the pip install bertopic[flair] procedure to see if that helps out a bit. It will take some experimentation so it can take a few weeks before it might be released.

I am curious though as to why you encountered badly distinguished topics. If I use 'crawl' in Flair with transformers 3.5.1 the topics seem to be nicely separated. Could you share the code for creating the topic model? Perhaps it was not the transformer package version that could be the culprit.

fireindark707 commented 3 years ago

Thank you for your response. Here is my code, I used the same bert-base-chinese model(google), but I got different clustering performance with different versions of transformers.

import os
import json
from sklearn.feature_extraction.text import CountVectorizer
import jieba
from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings
import re

def tokenize_zh(text):
    words = jieba.lcut(text)
    return words

model_name = 'bert-base-chinese'
with open("chinese_stop_words.json","r") as f:
  stop_words = json.load(f)
vectorizer = CountVectorizer(tokenizer=tokenize_zh,stop_words=stop_words)
bert_zh = TransformerDocumentEmbeddings(model_name)
topic_model = BERTopic(embedding_model=bert_zh,top_n_words=15,min_topic_size=20, verbose = True,vectorizer_model=vectorizer)
topics, probabilities = topic_model.fit_transform(docs)
MaartenGr commented 3 years ago

At least one thing I can note is that BERTopic, due to the inclusion of UMAP, is stochastic. This means that each time you will run BERTopic, regardless of whether you changed parameters, the resulting topics will differ. Therefore, it is difficult to state that the clustering performance is different with different versions of transformers since multiple runs will differ by definition.

fireindark707 commented 3 years ago

At least one thing I can note is that BERTopic, due to the inclusion of UMAP, is stochastic. This means that each time you will run BERTopic, regardless of whether you changed parameters, the resulting topics will differ. Therefore, it is difficult to state that the clustering performance is different with different versions of transformers since multiple runs will differ by definition.

Thank you for your reply, I will try more experiments.