[QST] BERTopic as a general model?

bjpietrzak commented 1 year ago

QUESTION:

I want to create a BERTopic model architecture that will be able to extract topics from any list of documents and still give reasonable results when fitted to said documents. Is it even possible? My modifications to the BERTopic pipeline didn't yield satisfying results.

RATIONALE FOR THE QUESTION:

I'm planning to build a GUI for users to allow them to choose any application from the Google Play Store and display the main issues (topics extracted from app reviews) that users identify in a given app.

That's why I want to create a one-size-fits-all BERTopic model to enable seamless transitions between different apps. Naturally, the model will need to be trained on the app's reviews, and the training speed will depend solely on the hardware the model runs on.

MaartenGr commented 1 year ago

I want to create a BERTopic model architecture that will be able to extract topics from any list of documents and still give reasonable results when fitted to said documents. Is it even possible? My modifications to the BERTopic pipeline didn't yield satisfying results.

Could you go a bit more in-depth into what the issue is? It is not clear from your description what the main problem is that you are facing. Are you not happy with the output? If so, why? Is it an issue with the keywords? Have you tried following the best practices guide?

Make sure to be as complete as possible.

bjpietrzak commented 1 year ago

Have you tried following the best practices guide?

Yes, I have.

Could you go a bit more in-depth into what the issue is?

Certainly. The main problem is with the clustering of embeddings. For some app reviews, the clusters seem too broad to me.

For example, these are the topics I was able to extract after fitting the model to YouTube app reviews. Everything is fine in this example:

- '1_chromecast_casting_google_app'
- '2_youtube_tv_app_time'
- '3_youtube_tv_cable_channels'
- '4_commercials_ads_commercial_minutes'
- '5_yttv_yt_tv_price'
- '6_app_guide_shows_channel'
- '7_youtube_app_tv_phone'
- '8_charged_cancel_trial_free'
- '9_price_channels_month_4k'
- '10_phone_app_service_picture'
...

But for the Instagram app reviews, BERTopic's fit_transform can only produce three topics, which are really general:

- '0_app_open_video_videos'
- '1_tiktok_shorts_video_watch'
- '2_app_videos_different_popular'

In both cases, I was using the same model architecture and the same amount of documents (2500). I read around 100 random samples from both YouTube and Instagram reviews, and I could perceive more than 12 topics/themes/recurring problems in both of them.

To sum up the above: The main problem is with the quality of clusters that HDBSCAN can produce; they are too general/broad. (I have tried different clustering models like k-Means, but with worse results).

MaartenGr commented 1 year ago

The difficulty with the clustering algorithm is that it is not possible to have a one-size-fits-all solution since the quality of the clusters (for example the number of clusters), is quite subjective. To illustrate, the number of topics that were found on your Instagram might be too few for you but might be the perfect amount for someone else. The depth/broadness/specificness of topics is also something that, to a large extent, depends on the use case and the perspective of the user.

This also relates to the input data. Just because the number of documents is the same between the two runs, the distribution of embeddings that were created as a result of them is unlikely to be similar. Therefore, additional tuning of the clustering algorithm with respect to the input data is necessary.

If your problem is purely the number of topics created, then it is also possible to approach this a bit differently. By setting a very small min_topic_size you are likely to generate more micro-clusters. For many, this can be a disadvantage. However, it would allow you to start with a large number of topics that you can automatically aggregate to a specified number of topics afterward.

You can do this within BERTopic or you can create a custom technique yourself. For example, merge topics until you have fewer than 50 topics, after that only merge topics if they are very similar to one another.

All in all, I think it is also a matter of diving into the clustering algorithm and what you can expect from these kinds of algorithms. For example, the number of clusters in the second example might increase if you lower min_topic_size. Just because you have the same number of documents does not mean they contain the same amount of topics and that is something for you to account for.

bjpietrzak commented 1 year ago

All in all, I think it is also a matter of diving into the clustering algorithm and what you can expect from these kinds of algorithms.

Okay, I'll look into it. Thank you

MaartenGr / BERTopic

[QST] BERTopic as a general model? #1578

QUESTION:

RATIONALE FOR THE QUESTION: