MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.31k stars 336 forks source link

Bug: possible mistake in MMR calculation #192

Open schackartk opened 6 months ago

schackartk commented 6 months ago

Hello,

If I am reading the code correctly, there is a mistake in the implementation of maximal marginal relevance (MMR) calculation.

Referring to the original publication https://doi.org/10.1145/290941.291025, the calculation is:

image

and the code as currently implemented:

mmr = (
    1 - diversity
) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
mmr_idx = candidates_idx[np.argmax(mmr)]

assuming:

and I am assuming the last point because of the code:

target_similarities = np.max(
        word_similarity[candidates_idx][:, keywords_idx], axis=1
    )

the code should be:

mmr = (1 - diversity) *
(candidate_similarities - diversity * target_similarities.reshape(-1, 1))
mmr_idx = candidates_idx[np.argmax(mmr)]

So it appears to me that diversity is not distributed to both similarity terms as in the original equation; there needs to be parens around the difference between the similarity terms

I would note that I have seen a similar lack of parentheses, which distribute the diversity term (λ), in other works, for example http://www.cs.bilkent.edu.tr/~canf/CS533/hwSpring14/eightMinPresentations/handoutMMR.pdf

MaartenGr commented 6 months ago

Thanks for sharing this! Coincidentally, I indeed used the following as the main source for calculating the diversity:

I would note that I have seen a similar lack of parentheses, which distribute the diversity term (λ), in other works, for example http://www.cs.bilkent.edu.tr/~canf/CS533/hwSpring14/eightMinPresentations/handoutMMR.pdf

Having said that, it might be worthwhile to test out the effect of changing the parentheses. I am quite curious to see how that would affect representation. Moreover, there are quite a number of other libraries that have MMR implemented, such as LangChain and vector database applications. I could check out what their prefered method of doing so is.