MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6k stars 752 forks source link

Is it possible to use Flair for Sentence Embeddings? #145

Closed pankratz-l closed 3 years ago

pankratz-l commented 3 years ago

Hi Maarten,

Thank you for this great work! I am currently working with Bertopic and a German language model from Huggingface. As far as I understand it, the topics are clustered at the document level and thus each document is assigned to a topic / cluster. Would it be possible to use sentence embeddings with this language model and thus cluster the topics within the documents? (Similar to LDA, a document consists of multiple topics). Sorry if this is a beginner question.

Currently I provide Bertopic with a list of tokens for each document without punctuation. This approach works best so far. I think it's because I use document embeddings and so the sentences don't matter?

Here is my code:

from flair.embeddings import TransformerDocumentEmbeddings german = TransformerDocumentEmbeddings('german-nlp-group/electra-base-german-uncased') topic_model = BERTopic(embedding_model=german, n_gram_range=(1, 2), top_n_words=15, min_topic_size=4).fit(docs)

MaartenGr commented 3 years ago

Thank you for your kind words!

Actually, BERTopic works best if you do not provide it with a list of tokens. Instead, I would highly advise you to provide it with full sentences as BERT-based models are quite good at creating contextual embeddings.

Make sure that docs is a list of documents and not a list of tokens. Then, use the multi-lingual model that is provided within BERTopic by setting language="multilingual" or use embedding_model="paraphrase-multilingual-mpnet-base-v2". The latter is the best-performing model where the former strikes a nice balance between speed and performance.

from bertopic import BERTopic

topic_model = BERTopic(language="multilingual")  # Either this
# topic_model = BERTopic(embedding_model="paraphrase-multilingual-mpnet-base-v2")  # or this
topics, probs = topic_model.fit_transform(docs)

I am a big fan of these models which show exceptional performance even in multi-lingual cases. I suggest trying the embeddings above out and see what happens!

pankratz-l commented 3 years ago

Thank you so much. I have much better results now :). I just have to deal with stopwords now.

MaartenGr commented 3 years ago

Glad to hear it works better now! There are two things you can do to prevent stopwords. First, is simply adding more documents. the c-TF-IDF calculation, like the classic TF-IDF calculation, will most likely prevent stopwords if you have large amounts of data available.

The second is more immediate, simply create your own CountVectorizer with removing stopwords of your choice:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words=my_list_of_stopwords)
topic_model = BERTopic(vectorizer_model=vectorizer_model)