Open zilch42 opened 6 months ago
Thanks for the suggestion!
I want to prevent adding any parameter to the init of BERTopic as that would further complicate using the model. Having said that, I think you can already do what you suggested as follows:
zeroshot_min_similarity=0
to make sure that all documents are assigned to the zero-shot topics. This will prevent clustering..probabilities_
) to select only the documents that exceed your specified threshold. So the threshold you would normally use in zeroshot topic modeling. Retain the topic label of topics that exceed the threshold, set the label of topics that do not exceed this threshold to -1. In essence, you are creating .topics_
. Thanks Maarten,
That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so topic_model.probabilities_
is nan
so I'm recalculating the zeroshot topic embeddings and the cosine similarities myself. That's not a big deal as it doesn't take long, but it would make sense for the max cosine similarity to be saved in probabilities_
as that is basically what they are. Its probably a one liner to add if you'd like a PR.
That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so topicmodel.probabilities is nan so I'm recalculating the zeroshot topic embeddings and the cosine similarities myself.
Are you using .transform
for that? That way, you wouldn't have to do anything outside of BERTopic.
That's not a big deal as it doesn't take long, but it would make sense for the max cosine similarity to be saved in probabilities_ as that is basically what they are. Its probably a one liner to add if you'd like a PR.
Not sure if I understand what you mean. Do you mean calculating the probabilities already during zero-shot topic modeling? That should indeed be straightforward.
Hi Maarten,
I have a use case at the moment where I'm using zero shot topic modeling to assign documents to a list of known clusters. I'm not really interested in finding other unknown clusters in the data, but I do know that there will be some documents that don't match anything and I would just like them to be outliers.
At the moment, the workflow for zeroshot is that any documents that dont match a zeroshot topic to a certain threshold go into a pool to be run through the standard bertopic pipeline. That's useful in some cases, but not others. One issue that I have encountered is that if there are only a few docs that don't fit into a topic (e.g. 4), UMAP can't handle it and produces an error (the same error in #1900 when I was trying to visualise only 4 topics).
Could we have an option in zeroshot to determine where to direct documents that fall below
zeroshot_min_similarity
? Either to outliers or to reclustering?Cheers