Closed gabriel-almeida closed 10 years ago
The current settings are the product of a lot of fiddling. I think the only principled way to normalize this distance is to use a learnable edit distanced #14. See http://lingpipe-blog.com/2010/04/25/sequence-alignment-with-conditional-random-fields/
Duplicate of #14
Your normalization factor in normalizedAffineGapDistance() function do not put the return between 0 and 1. Since most machine learning algorithms works better with all features within a same range, I'm warning it.
Considering that matchWeight is the lowest weight and mismatchWeight is the greatest, a possible correction would be something like:
maxLength = max( len(string1), len(string2) ) minDistance = matchWeight * maxLength maxDistance = mismatchWeight * maxLength return (distance - minDistance)/(maxDistance - minDistance)
Edit: Note that its still need some modifications, as it overestimate a lot the greatest possible distance when the strings have very different sizes.