MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.78k stars 720 forks source link

MMR doesn't work/ doesn't make a change #1654

Open daianacric95 opened 7 months ago

daianacric95 commented 7 months ago

Hi Maarten,

Thank you once again for this amazing package. I used it for my master's thesis and several projects for my job at the university, and it's a lifesaver compared to other topic modeling techniques I tried.

That being said, I have run the model on about 200k tweets and many topics have quite a lot of repeating words. I have used the following code to add a representation model with different parameters (starting from .3 to .9) but the results are still the same.

mmr = MaximalMarginalRelevance(diversity=.9)
representation_model = {
   "mmr":  mmr
}

topic_model = BERTopic (
    umap_model=UMAP(),
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    embedding_model=embedding_model,
    top_n_words=10,
    language='english',
    verbose=True,
    representation_model=representation_model

)

Here are a couple of examples:

Representation: australia,australias,australian,auspol,renewable,nsw,energy,government,queensland,coal MMR: australia,australias,australian,auspol,renewable,nsw,energy,government,queensland,coal

Not only the results are the same, but plurals such as australia and australians are not merged. Could you please guide me on how to move further?

MaartenGr commented 7 months ago

That is a result of how MMR works! In practice, it tries to diversify a number of words, for instance, 30, to a lower value, for instance 10. This means that when you set top_n_words to 10, it will only give 10 keywords to MMR. Diversifying 10 keywords into 10 keywords means it will not do anything. In other words, in BERTopic, set top_n_words to a higher value, like 30, and it can actually diversify a set of words into a smaller set of words.