Open rcy1122 opened 5 months ago
def _calc_idf(self, nd):
"""
Calculates frequencies of terms in documents and in corpus.
This algorithm sets a floor on the idf values to eps * average_idf
"""
# collect idf sum to calculate an average idf for epsilon value
idf_sum = 0
# collect words with negative idf to set them a special epsilon value.
# idf can be negative if word is contained in more than half of documents
negative_idfs = []
for word, freq in nd.items():
--> idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
self.idf[word] = idf
idf_sum += idf
if idf < 0:
negative_idfs.append(word)
self.average_idf = idf_sum / len(self.idf)
eps = self.epsilon * self.average_idf
for word in negative_idfs:
self.idf[word] = eps
Here
idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
If self.corpus_size = 2
, freq=1
, then idf
will be equal to 0.
When using get_scores
to calculate scores:
def get_scores(self, query):
"""
The ATIRE BM25 variant uses an idf function which uses a log(idf) score. To prevent negative idf scores,
this algorithm also adds a floor to the idf value of epsilon.
See [Trotman, A., X. Jia, M. Crane, Towards an Efficient and Effective Search Engine] for more info
:param query:
:return:
"""
score = np.zeros(self.corpus_size)
doc_len = np.array(self.doc_len)
for q in query:
q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
print("self.idf.get(q) or 0 = ", self.idf.get(q) or 0)
--> score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
(q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
return score
==> When calculating idf (self.idf.get(q) or 0)
all will be equal to 0.
Is this correct?
it seems like the trigger condition is when freq * 2 = self.corpus_size
. we're missing the +1 term in calculating idf, which would also prevent negative idf values
there seems to be an attempt to fix this missing +1 term here: https://github.com/dorianbrown/rank_bm25/pull/40
but the proposed code change mysteriously removes the freq
term from calculation
shouldn't
- idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
+ idf = math.log((self.corpus_size - freq + 0.5) / freq + 0.5) + 1)
be the right expression for calcuating idf?
But
The difference lies in the number of corpus elements. It should be incorrect, but I don't know why?