Request: Zeroshot option to assign unassigned documents to outliers rather than reclustering

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

6.17k stars 764 forks source link

Request: Zeroshot option to assign unassigned documents to outliers rather than reclustering #1958

Open zilch42 opened 6 months ago

zilch42 commented 6 months ago

Hi Maarten,

I have a use case at the moment where I'm using zero shot topic modeling to assign documents to a list of known clusters. I'm not really interested in finding other unknown clusters in the data, but I do know that there will be some documents that don't match anything and I would just like them to be outliers.

At the moment, the workflow for zeroshot is that any documents that dont match a zeroshot topic to a certain threshold go into a pool to be run through the standard bertopic pipeline. That's useful in some cases, but not others. One issue that I have encountered is that if there are only a few docs that don't fit into a topic (e.g. 4), UMAP can't handle it and produces an error (the same error in #1900 when I was trying to visualise only 4 topics).

Could we have an option in zeroshot to determine where to direct documents that fall below zeroshot_min_similarity? Either to outliers or to reclustering?

Cheers

MaartenGr commented 6 months ago

Thanks for the suggestion!

I want to prevent adding any parameter to the init of BERTopic as that would further complicate using the model. Having said that, I think you can already do what you suggested as follows:

Train a zero-shot topic model and set zeroshot_min_similarity=0 to make sure that all documents are assigned to the zero-shot topics. This will prevent clustering.
Use the resulting probabilities (.probabilities_) to select only the documents that exceed your specified threshold. So the threshold you would normally use in zeroshot topic modeling. Retain the topic label of topics that exceed the threshold, set the label of topics that do not exceed this threshold to -1. In essence, you are creating .topics_.
Finally, use manual BERTopic to model your newly created topics.

zilch42 commented 6 months ago

Thanks Maarten,

That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so topic_model.probabilities_ is nan so I'm recalculating the zeroshot topic embeddings and the cosine similarities myself. That's not a big deal as it doesn't take long, but it would make sense for the max cosine similarity to be saved in probabilities_ as that is basically what they are. Its probably a one liner to add if you'd like a PR.

MaartenGr commented 6 months ago

That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so topicmodel.probabilities is nan so I'm recalculating the zeroshot topic embeddings and the cosine similarities myself.

Are you using .transform for that? That way, you wouldn't have to do anything outside of BERTopic.

That's not a big deal as it doesn't take long, but it would make sense for the max cosine similarity to be saved in probabilities_ as that is basically what they are. Its probably a one liner to add if you'd like a PR.

Not sure if I understand what you mean. Do you mean calculating the probabilities already during zero-shot topic modeling? That should indeed be straightforward.