Currently this line computes the inverse document frequency of a document (a string) inside a corpus.
That line reads idf = numpy.log1p(N / len(docs)) which computes the value log(1 + x) with x = N / len(docs). However this formula does not appear in any of the variants listed on wikipedia that compute the idf. Why this choice?
In case we want to use the smooth idf, the formula should be idf = 1 + numpy.log(N / (1 + len(docs))).
it's a form of smoothing, of which there are many strategies. if you'd like to show your proposed smoothing strategy will make a difference in dedupe performance, i'd be happy to look at the PR.
Currently this line computes the inverse document frequency of a document (a string) inside a corpus.
That line reads
idf = numpy.log1p(N / len(docs))
which computes the valuelog(1 + x)
withx = N / len(docs)
. However this formula does not appear in any of the variants listed on wikipedia that compute the idf. Why this choice?In case we want to use the smooth idf, the formula should be
idf = 1 + numpy.log(N / (1 + len(docs)))
.