MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.03k stars 756 forks source link

Guided topic model with pre embedded `seed_topic_list` #2016

Open 1jamesthompson1 opened 4 months ago

1jamesthompson1 commented 4 months ago

This issues follows is about a similar problem addressed in #2014. I can update and merge

I would like to run a guided topic model with a embedding model that is not supported by BERTopic, I would also like to be able to test some hyperparameters without having to rerun the embeddings. To support this I would like to be able to pass the pre embeded seed_topic_list.

What I want to be able to do is something like this:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))["data"]

seed_topic_list = [["drug", "cancer", "drugs", "doctor"],
                   ["windows", "drive", "dos", "file"],
                   ["space", "launch", "orbit", "lunar"]]

embedded_seed_topic_list = np.random.rand(len(seed_topic_list), 1024)

topic_model = BERTopic(
    seed_topic_list=seed_topic_list,
    embedded_seed_topic_list = embedded_seed_topic_list,
    verbose=True)

topics, probs = topic_model.fit_transform(docs)

Like with #2014 I am happy to write up the simple change of adding in another argument so that it can check if the embeddings are arleady present before trying to embed the seed_topic_list.

MaartenGr commented 4 months ago

A good request for which I have the same answer as #2014 since for me they touch upon the same underlying issue. I'm okay with keeping this issue open for others and continuing the discussion in the other issue.