MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

Add multi-label support for topics_per_class #872

Open lorsanta opened 1 year ago

lorsanta commented 1 year ago

Hi! I'm working with a multi-label dataset, and I'm trying to use the topics_per_class function. However, I noticed that the function only supports single labels. It would be great if the function could support multi-label datasets as well.

Maybe by adding an optional argument called problem_type, which could be set to either "multi-label" or "single-label", or by just checking the classes[0] type to be equal to list and change the behavior of the function based on that.

Personally to make it work I changed the lines:

https://github.com/MaartenGr/BERTopic/blob/845d423bdef44a4a68fc0b1c9362f97237035d3c/bertopic/_bertopic.py#L769-L772

with

  labels_list=set([label for labels_article in classes for label in labels_article])
  for _, class_ in tqdm(enumerate(labels_list), disable=not topic_model.verbose):

      # Calculate c-TF-IDF representation for a specific timestamp 
      selection = documents[documents.Class.apply(lambda c: class_ in c)]
MaartenGr commented 1 year ago

Thanks for sharing this! I will have to check whether additional sub-selection will be necessary or if a combination of classes might be preferred for some users. Combined with the visualization there might be additional updates necessary.