MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.17k stars 764 forks source link

MMR doesn't work/ doesn't make a change #1654

Open daianacric95 opened 11 months ago

daianacric95 commented 11 months ago

Hi Maarten,

Thank you once again for this amazing package. I used it for my master's thesis and several projects for my job at the university, and it's a lifesaver compared to other topic modeling techniques I tried.

That being said, I have run the model on about 200k tweets and many topics have quite a lot of repeating words. I have used the following code to add a representation model with different parameters (starting from .3 to .9) but the results are still the same.

mmr = MaximalMarginalRelevance(diversity=.9)
representation_model = {
   "mmr":  mmr
}

topic_model = BERTopic (
    umap_model=UMAP(),
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    embedding_model=embedding_model,
    top_n_words=10,
    language='english',
    verbose=True,
    representation_model=representation_model

)

Here are a couple of examples:

Representation: australia,australias,australian,auspol,renewable,nsw,energy,government,queensland,coal MMR: australia,australias,australian,auspol,renewable,nsw,energy,government,queensland,coal

Not only the results are the same, but plurals such as australia and australians are not merged. Could you please guide me on how to move further?

MaartenGr commented 11 months ago

That is a result of how MMR works! In practice, it tries to diversify a number of words, for instance, 30, to a lower value, for instance 10. This means that when you set top_n_words to 10, it will only give 10 keywords to MMR. Diversifying 10 keywords into 10 keywords means it will not do anything. In other words, in BERTopic, set top_n_words to a higher value, like 30, and it can actually diversify a set of words into a smaller set of words.