dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.13k stars 551 forks source link

Normalized Affine Gap algorithm is not really normalized #289

Closed gabriel-almeida closed 10 years ago

gabriel-almeida commented 10 years ago

Your normalization factor in normalizedAffineGapDistance() function do not put the return between 0 and 1. Since most machine learning algorithms works better with all features within a same range, I'm warning it.

Considering that matchWeight is the lowest weight and mismatchWeight is the greatest, a possible correction would be something like:

maxLength = max( len(string1), len(string2) ) minDistance = matchWeight * maxLength maxDistance = mismatchWeight * maxLength return (distance - minDistance)/(maxDistance - minDistance)

Edit: Note that its still need some modifications, as it overestimate a lot the greatest possible distance when the strings have very different sizes.

fgregg commented 10 years ago

The current settings are the product of a lot of fiddling. I think the only principled way to normalize this distance is to use a learnable edit distanced #14. See http://lingpipe-blog.com/2010/04/25/sequence-alignment-with-conditional-random-fields/

fgregg commented 10 years ago

Duplicate of #14