dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
1.05k stars 89 forks source link

Made changes to BM25T #9

Closed raoashish10 closed 2 years ago

raoashish10 commented 4 years ago

I have made some changes to the BM25T class according to the paper mentioned in the README. Although, I am not completely sure if these changes are the right ones because of certain ambiguities in the paper. I saw that you hadn't implemented certain things for example: image

One of the things that bugged me was the argmin function and I couldn't completely understand through the paper, what the output should signify and where it should be used, so instead I replaced the argmin function with just a min function assuming the argmin function was the index for the list of k' values.

Let me know if I have gone wrong anywhere.

dorianbrown commented 4 years ago

That was quick, thanks!

I don't have time to review it at the moment, but I should be able to look at it either this week or next week.

raoashish10 commented 4 years ago

Sure no problem! Let me know if I have made any mistakes.

On Tue, Jul 7, 2020, 19:44 Dorian Brown notifications@github.com wrote:

That was quick, thanks!

I don't have time to review it at the moment, but I should be able to look at it either this week or next week.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dorianbrown/rank_bm25/pull/9#issuecomment-654892757, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM665W7D2AYUYIMJBKI5TF3R2MUT3ANCNFSM4OOV6YCQ .

dorianbrown commented 4 years ago

Sorry for the slight delay, but I finally managed to take a look at it and compare it to the description in the paper.

So what the strategy is for this method is to calculate the parameter k1 instead of choosing one. This is done by finding the k1 that minimizes the equation (16). This minimization problem needs to be numerically solved (in the paper they use Newton-Raphson) for each term t in the query q, so we end up with a k1(t) for using in equation (5).

So replacing the argmin with min really changes the algorithm. Does that help make things a little more clear?