MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.76k stars 716 forks source link

Zero shot topic model with pre embedded zero shot topics #2014

Open 1jamesthompson1 opened 1 month ago

1jamesthompson1 commented 1 month ago

Preface, I have tried to read through the current issues. I dont think that any issues raises what I am wanting. Issues like this https://github.com/MaartenGr/BERTopic/issues/2011 sound promising but is talking about something different. I apologise if this has already been discussed!

I would like try out BERTopics zero shot modelling while using a proprietary embeding model (voyageai). Therefore I need to give BERTopic the embeddings for both the documents and zero shot topics.

An example would be something like this:

from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

zeroshot_topic_list_embeddings = np.random.rand(len(zeroshot_topic_list), 1024).astype(np.float32)
document_embeddings = np.random.rand(len(docs), 1024).astype(np.float32)

topic_model = BERTopic(
    embedding_model=None,
    min_topic_size=5,
    zeroshot_topic_list=zeroshot_topic_list,
    embedded_zeroshot_topic_list=zeroshot_topic_list_embeddings
    zeroshot_min_similarity=0.85
)

topics, _ = topic_model.fit_transform(docs, document_embeddings)

topic_model.get_topic_info()

Am I missing something with how BERTopic and zero-shot models should be working? If not I am happy to make PR with what seems to be the small changes that need to be made.

Potential solution I have had a look through _bertopic.py and it seems to be a relatively straight forward process. It seems that here it could just pass it the given zero-shot topic embedidngs. These embeddings would come from another init arugment. Then besides a few other changes like the _is_zeroshot() method.

MaartenGr commented 1 month ago

Hmmm, this is a bit tricky from a maintainer/user experience perspective because I want to keep the scope of the parameters as small as possible in order to create an easy experience. This does mean that I would like to prevent adding the embedded_zeroshot_topic_list as that would further increase the parameter space. The difficulty for me here is that it is a rather small and niche use case that does not affect most users.

Don't get me wrong, having the functionality would definitely be nice... Although not ideal, you could create a custom backend yourself and use that instead.