MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Multilabel Supervised Learning #826

Closed roelvanderburg closed 1 year ago

roelvanderburg commented 1 year ago

I'm training a supervised model where I have (for example) 100 documents and 20 topics. Some of the documents can have multiple topics assigned to them, moreover the documents cannot be split into smaller sections containing a single topic (they are already at paragraph level).

  1. What would be the best way to feed this data into the supervised algorithm? Make two separate entries where I feed the same document containing topic A and another entry with topic B?
  2. There is a parameter containing a min_topic_size which I would set at 20. However the model can actually find more topics than that, if the models find 40 topics, can I be certain that the first 20 topics it predicts in the list of 40 topics are the ones I fed into the supervised algorithm?
  3. Lastly, since I want to use my model for document classification the fact it finds more topics than I originally trained it with is not an issue, if I can "project" the 40 topics onto the original 20 topic set in some way. Is going through Get representative docs per topic a good way to find the original 20 topics belonging to the new 40 topic set.
  4. Alternatively I could use the reduce number of topics function, but that way I cannot be certain I get back my original set.
MaartenGr commented 1 year ago

I'm training a supervised model where I have (for example) 100 documents and 20 topics. Some of the documents can have multiple topics assigned to them, moreover the documents cannot be split into smaller sections containing a single topic (they are already at paragraph level).

Before going into your specific question let me first ask, why are you not using a classifier for this? BERTopic is mainly an unsupervised model with some tricks here and there for nudging topics but it is not fully supervised. Based on your description, I am wondering whether BERTopic is actually the right model for you. Especially if you already have labels for all of the documents.

What would be the best way to feed this data into the supervised algorithm? Make two separate entries where I feed the same document containing topic A and another entry with topic B?

The semi-supervised approach in BERTopic nudges topics towards their labels but nothing more. So there is no guarantee it will actually find the exact topics that you mentioned. Moreover, it does not support multi-label assignments the way you describe. However, you can use the parameter calculate_probabilities=True to create a document-topic probability matrix which allows you to assign more than a single topic to a document.

There is a parameter containing a min_topic_size which I would set at 20. However the model can actually find more topics than that, if the models find 40 topics, can I be certain that the first 20 topics it predicts in the list of 40 topics are the ones I fed into the supervised algorithm?

You can use a different clustering algorithm instead of HDBSCAN that allows you to set the number of topics. I would advise starting with k-Means and setting the number of clusters to 20.

Lastly, since I want to use my model for document classification the fact it finds more topics than I originally trained it with is not an issue, if I can "project" the 40 topics onto the original 20 topic set in some way. Is going through Get representative docs per topic a good way to find the original 20 topics belonging to the new 40 topic set.

You can use hierarchical topic modeling to see how topics can be best merged together. Then you can manually merge topics following this guide.

Alternatively I could use the reduce number of topics function, but that way I cannot be certain I get back my original set.

You can indeed automatically reduce the number of topics but as you mentioned, there is no guarantee that you get back your original set since it does not take that into account. Doing that manually would be preferred I think.