MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Base LLM topic representation on results from other representation model (e.g. KeyBERT or POS) instead of the default representation #2184

Closed Hveemos closed 1 month ago

Hveemos commented 1 month ago

Feature request

Possibility to change the Default representation on which OpenAI bases it's response on.

Motivation

As of now, when I create topic representation by OpenAI it bases the prompt on a couple of representative documents and the Default Representation keywords. But in my case these are oftentimes unusefull. However, I found both the KeyBERTInspired and PartOfSpeech quite usefull. So I would like to base the prompt on either of those instead.

Your contribution

I think this feature should be easy to implement if you know your way around this library (which I don't). So, I'm afraid I won't be of much help...

MaartenGr commented 1 month ago

Thank you for sharing this feature request! Note that it already does this if you set the representations to be the main ones. So doing representation_model=KeyBERT should already use the KeyBERTInspired keywords. Admittedly, it does not use additional aspects that you can choose and play around. That would certainly be nice to use!

Hveemos commented 1 month ago

Ok, so I was just reading the Multiple Representations instruction and realized that I could set my Default representation by just giving the model the correct name, "Main", e.i.:

# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("sv_core_news_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# GPT4
openai_model = OpenAI(client, model="gpt-4o", exponential_backoff=True, chat=True, prompt=prompt)

# All representation models
representation_model = {
    "Main": keybert_model,
    "OpenAI": openai_model, 
    "MMR": mmr_model,
    "POS": pos_model
}

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  representation_model=representation_model,
    vectorizer_model=vectorizer_model,
    hdbscan_model=hdbscan_model,
    umap_model=umap_model_5D,

  # Hyperparameters
  top_n_words=10,
  verbose=True,
  language="swedish",
  n_gram_range=(1, 2),
)

So, thank you. That'll be all!