Same corpus, the other score is 0

rcy1122 commented 5 months ago

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London"
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

[0. 0.93729472 0. ]

But

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    # "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

[0. 0.]

The difference lies in the number of corpus elements. It should be incorrect, but I don't know why？

rcy1122 commented 5 months ago

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            --> idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
            self.idf[word] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps

Here
idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)

If self.corpus_size = 2, freq=1, then idf will be equal to 0.

When using get_scores to calculate scores:

    def get_scores(self, query):
        """
        The ATIRE BM25 variant uses an idf function which uses a log(idf) score. To prevent negative idf scores,
        this algorithm also adds a floor to the idf value of epsilon.
        See [Trotman, A., X. Jia, M. Crane, Towards an Efficient and Effective Search Engine] for more info
        :param query:
        :return:
        """
        score = np.zeros(self.corpus_size)
        doc_len = np.array(self.doc_len)
        for q in query:
            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
            print("self.idf.get(q) or 0 = ", self.idf.get(q) or 0)

           --> score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
        return score

==> When calculating idf (self.idf.get(q) or 0) all will be equal to 0.

Is this correct?

jameswu1991 commented 3 weeks ago

it seems like the trigger condition is when freq * 2 = self.corpus_size. we're missing the +1 term in calculating idf, which would also prevent negative idf values

there seems to be an attempt to fix this missing +1 term here: https://github.com/dorianbrown/rank_bm25/pull/40 but the proposed code change mysteriously removes the freq term from calculation

shouldn't

- idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
+ idf = math.log((self.corpus_size - freq + 0.5) / freq + 0.5) + 1)

be the right expression for calcuating idf?

rcy1122 commented 3 weeks ago

@jameswu1991 Hi, Do you want to edit this issue? #40 has not been merged yet. I don't understand bm25very well, but is it possible to calculate it using : idf = math.log((self.corpus_size + 0.5) / (freq + 0.5))?

Thanks.

dorianbrown / rank_bm25

Same corpus, the other score is 0 #43