KyleAMathews / Fuzzymatcher.js

Fuzzymatching library optimized for powering auto-complete widgets
http://kyleamathews.github.com/Fuzzymatcher.js/
MIT License
18 stars 4 forks source link

Issues while computing the score. #5

Closed jeancroy closed 9 years ago

jeancroy commented 9 years ago

If I understand properly you are running bitap algorithm to find position, then use levenshtein to find edit distance.


Fuzzymatch.prototype._query = function(listName, query) {
//..
var position = self._match_main(data[i].name.toLowerCase(), query, 0);
//...
 var distance = self.levenshtein(datum.name, query);
 var score = distance + position;
//...
}

However bitap by itself is a levenshtein distance algorithm, so you are computing the score twice. A wiser option would be to modify _match_bitap_ to output it's score (and if you wish, position) :

//d is levenshtein distance, j is position
var score = match_bitapScore_(d, j - 1);

(You could however use the levenshtein function to compute match on string >32 char, because that is a limitation of bitap.)

Finaly that line:

 var score = distance + position;

It means that an offset on 1 in position is as costly as an error in edit distance. It's more severe than usual, and if it's indeed what you need, then there's no need to compute position at all, just compute distance from position 0 (error in distance and in position have same cost)

A typical usage is like this (score=distance+position/10 with your settings)

    function match_bitapScore_(e, x) {
       //...
       return accuracy + (proximity / dmp.options.Match_Distance);
   }

Lastly: Performant to at least ~10,000 items. My own experience with bitap suggest it's has trouble handling that many data in interactive time. have you tested that many item without if (num_matchs > 100) { break; } ? Supose you search for a string that is so uncommon that all of 10000 record need to be searched ?

KyleAMathews commented 9 years ago

I haven't touched this library in years... are you interested in maintaining it and fixing up the problems you see in the code?