MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

Using BERTopic on Chinese and Japanese Texts #1157

Open Damen-C opened 1 year ago

Damen-C commented 1 year ago

Hello Maarten, there is one thing I would like to mention when using BERTopic to analyze Chinese and Japanese texts. If we run the following code to analyze Chinese or Japanese:

from bertopic import BERTopic topic_model_multi = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True) topics_multi, probs_multi = topic_model_multi.fit_transform(texts)

We will get the results that look like the following (in case of analyzing Japanese texts):

9_おめでとう_おはようございます_おはありです_ありがとうございます 10_10552100本祭は雨天時は29日へ順延タイムスケジュール等は後日流れるのでお待ちく... 11_野党がひたすら揚げ足取りをしているというのはどういうことでしょうか私は今通常国会開会か...

However, the topics here are more like sentences than words.

After some research, some online posts indicate that for texts that are not separated by spaces like English or French are, we need to first transform the texts into a similar English text format (with space between words), and then use BERTopic to analyze the transformed text. Hence, we cannot initiate BERTopic on Chinese or Japanese (texts that are not separated by spaces) right out of the box without any preprocessing.

I would like to ask your opinion on this issue. If you agree on what were being discussed, could you make a reminder so that more people will know about this?

Thanks a lot!

Damen-C commented 1 year ago

I found this post in your Q and A section. So I think it answers my question about Chinese texts. Does it also applies to Japanese texts?

How can I use BERTopic with Chinese documents?

Currently, CountVectorizer tokenizes text by splitting whitespace which does not work for Chinese. To get it to work, you will have to create a custom CountVectorizer with jieba:

MaartenGr commented 1 year ago

I found this post in your Q and A section. So I think it answers my question about Chinese texts. Does it also applies to Japanese texts?

Yes, the general principle of tokenization for Japanese texts would also apply here. You would have to give the CountVectorizer a tokenizer that can tokenize Japanese texts but then it should work. If you have an example of a well-known Japanese tokenizer, then I can add that to the documentation.

Damen-C commented 1 year ago

Thanks for your quick response. One example is to use MeCab. You can find its documentation here: https://github.com/SamuraiT/mecab-python3

Following your convention, I modified the example you provided for Chinese texts:

from sklearn.feature_extraction.text import CountVectorizer
import MeCab

def tokenize_jp(text):

    words = MeCab.Tagger("-Owakati").parse(text).split()
    return words
vectorizer = CountVectorizer(tokenizer=tokenize_jp)
from bertopic import BERTopic
topic_model = BERTopic(language="japanese", calculate_probabilities=True, verbose=True, vectorizer_model=vectorizer)
topics, probs = topic_model.fit_transform(texts)

Then the result will more look like topics. Example: 4 _ああ_ドン_な_県 5_質問_匿名_募集_中 6_コンテンツ_論_応用_2020 7_first_sixtonesann_you_i

MaartenGr commented 1 year ago

Glad to hear that the tokenizer works. I will have to do a bit more research for tokenizers that work well on Chinese and Japanese texts and add those to the documentation.

Damen-C commented 1 year ago

Thanks for the efforts. If you have any more questions on analyzing Japanese or Chinese texts using your tool, feel free to reach out. I am currently doing NLP research in Japan, so I will be happy to contribute! I look forward to the updated documentation.