dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.16k stars 551 forks source link

About Inverse Document Frequency implementation #1126

Closed lmores closed 1 year ago

lmores commented 1 year ago

Currently this line computes the inverse document frequency of a document (a string) inside a corpus.

That line reads idf = numpy.log1p(N / len(docs)) which computes the value log(1 + x) with x = N / len(docs). However this formula does not appear in any of the variants listed on wikipedia that compute the idf. Why this choice?

In case we want to use the smooth idf, the formula should be idf = 1 + numpy.log(N / (1 + len(docs))).

fgregg commented 1 year ago

it's a form of smoothing, of which there are many strategies. if you'd like to show your proposed smoothing strategy will make a difference in dedupe performance, i'd be happy to look at the PR.