Fine-Tuning i.e. fitting on one data set and transforming i.e. infering on other data

TschonnyCache commented 1 year ago

What I want to do: I want to find out about what topics the authors of certain telegram channels post. With the standard BERTopic the resulting topics are not very uninformative and overlapping. For example (In german) {0: 'uhr menschen bürger protest freien grimma oschatz uta kretschmer hesse ', 1: 'uhr polizei grimma antifa menschen bürger straße kretschmer demonstration protest ', 2: 'uhr grimma menschen protest polizei kretschmer geht bürger oschatz straße ', 3: 'uhr bürger leipziger protest menschen oschatz straße spaziergang grimma 19 ', 4: 'uhr leipziger bürger menschen antifa uta straße oschatz hesse freien ', 5: 'antifa polizei szene leipziger freien grimma bürger straße linksextremisten stadt ', 6: 'polizei demonstration demo eben demonstrationen november szene etwa mittel linke '}

Each channel dump is only around 900 posts of max 100 words. I think the corpora are just to small to have bert learn the importance of buzzwords like antifa. So I crawled a lot more channels and have a corpus of 200k posts now. My assumption was I could use BERTopic().fit on this large corpus and then infer it on a dump of a smaller channel (let' call it target data) with BERTopic().transform Does this general approach make sense or did I do something completely nonsensical? I found that the resulting topics contained buzzwords that are not part of the target data. Is there a way to restrict the model trained on the large corpus to only use the words from the target data when transforming it? Also the result are a lot of topics. Like 500+ Topics for a corpus of 900 posts. I suppose this is also related to having it trained on the large corpus and not restricted to the words of the target data. I since have used 'BERTopic().reduce_topics(large_corpus,nr_topics=50)' to reduce the topics. Sadly the resulting topics now only contain (german) stop words. I have used SentenceTransformer('distilbert-base-nli-mean-tokens'). No special config of UMAP or HDBSCAN was used. What are your recommendations?

MaartenGr commented 1 year ago

My assumption was I could use BERTopic().fit on this large corpus and then infer it on a dump of a smaller channel (let' call it target data) with BERTopic().transform

This will not work since .fit updates the internal model but .transform does not. It will not generate new topics or update them.

I found that the resulting topics contained buzzwords that are not part of the target data. Is there a way to restrict the model trained on the large corpus to only use the words from the target data when transforming it?

All words in the topics are derived from the data that you train on. This means that it is not possible to get words that are not in the data if you train BERTopic without any additional representation models.

Also the result are a lot of topics. Like 500+ Topics for a corpus of 900 posts. I suppose this is also related to having it trained on the large corpus and not restricted to the words of the target data.

The number of topics that are generated are a direct result of the min_topic_size parameter. If you increase that value, you will get fewer topics.

I since have used 'BERTopic().reduce_topics(large_corpus,nr_topics=50)' to reduce the topics.

I would advise not using .reduce_topics but instead work on using min_topic_size first for reducing the number of topics created. If you run into too many outliers, then use .reduce_outliers.

Sadly the resulting topics now only contain (german) stop words.

You can remove stopwords by passing them as a list to the CountVectorizer.

I have used SentenceTransformer('distilbert-base-nli-mean-tokens').

You are using an english model for german texts. You should use either a german model or a multilingual one. The following is a good multi-lingual model: paraphrase-multilingual-mpnet-base-v2.

Based on your questions, I would advise two things. First, reading through the underlying algorithmic steps of BERTopic to get a sense of how it trains. Second, there is a page about best practices in BERTopic with a number of steps that should give you better results. Some of the questions you asked are answered in both and I think that there are a number of interesting tricks that you can use for your use case.

TschonnyCache commented 1 year ago

Thank you for your swift reply. Sorry I didn't read the docs thoroughly enought previously. I will now.

This will not work since .fit updates the internal model but .transform does not. It will not generate new topics or update them.

Thanks for making the differences clear. Maybe the docu could also say ", use transform to predict to which topics new instances belong. No new topics are computed." To make it clearer for noobs like me.

MaartenGr / BERTopic

Fine-Tuning i.e. fitting on one data set and transforming i.e. infering on other data #1474