MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6k stars 752 forks source link

semi-supervised topic modelling with multiple labels per document #1816

Open justin-boldsen opened 7 months ago

justin-boldsen commented 7 months ago

Hi there,

I have a dataset with 2000 participants who reported their most negative daily event for 90 days. They completed three questions related to their most negative event: (1) an open-ended written question "what was you most negative event today? (2) which categories does this event belong to (select all that apply) (e.g., mental health, physical health, relationship with family, etc) (3) how negative was this event (i.e., 7-point likert with 1 - not at all negative and 7 - very negative)

The known topic categories are rather coarse, so the aim of using BERTopic is find a more fine-grained understanding topics participants' wrote about. From https://github.com/MaartenGr/BERTopic/issues/826#issuecomment-1306746031 I understand that the semi-supervised labelling does not support multiple labels.

I have also read https://github.com/MaartenGr/BERTopic/issues/1725#issuecomment-1879975094. However I'm not sure if there are alternatives to the suggestions provided there.

In a nutshell, I'm wondering what would be the best way to combine the above features (written response, known categories, and participant ratings) to improve the performance of BERTopic in finding topics. Any help in how to proceed would be greatly appreciated!

Thanks, Justin

MaartenGr commented 7 months ago

That's an interesting use case! Let's see if I can be of help here.

It is definitely true that at the moment BERTopic does not directly support multiple labels for semi-supervised topic modeling. You might, however, be able to approximate those. Depending on the number of unique combinations of categories that you have, you could create a unique label for them. For instance, a document labeled with mental health and physical health would get label 0 whereas a document labeled with mental health and relationship with family would get label 1.

Another "trick" is to simply add those labels to the documents themselves. For instance, add the string "Categories: mental health, physical health" as a prefix. It does require some understanding with respect to the underlying embedding model and might not work in all cases but it could be an interesting trick to perform.

Other than that, you could also think about some covariate analysis by training on the whole corpus and then analyzing the metadata.

justin-boldsen commented 7 months ago

Oh, wonderful! Thank you so much for your response. I'll definitely try out these methods and circle back with my feedback here. Much appreciated!