garfieldnate / p5-Tree-BK

Efficient fuzzy matching in Perl
Other
1 stars 2 forks source link

Text::Levenshtein::XS can take a max distance threshold #2

Closed ugexe closed 10 years ago

ugexe commented 10 years ago

https://github.com/garfieldnate/p5-Tree-BK/blob/master/lib/Tree/BK.pm#L129

If you pass Text::Levenshtein::XS $threshold for its 3rd argument it will stop calculations and return undef if the threshold is exceeded. This saves time with longer strings, although for most words its probably negligible.

You would need to add something to your new's $metric assignment in case someone supplies a module that does not have this max distance functionality. Something like: { eval { supplied_func("aa","bb",1) }; $@ ? (use 2 argument with no max distance) : (use 3 argument with max distance) }

All the distance metric modules i've worked on support this. Text::LevenshteinXS does not, but no one should be using that module as it does not work correctly.

garfieldnate commented 10 years ago

I'm sorry, I don't quite understand the issue. Line 129 calls Text::Levenshtein::XS, or whatever metric subroutine was supplied, without the max_distance parameter, so it should never return undef, right?

ugexe commented 10 years ago

Correct. However if you 'did' pass a max_distance parameter it could be much faster depending on the input. If the user is not interested in strings beyond a certain max distance then the algorithm can bail out and not finish its calculations. With how you currently have it those results get filtered 'but' the algorithm is still calculating the entire thing. For most things this is probably fine, but for larger strings it might be wise to use. If you have 'a' x 2100 and 'b' x 2100 and are only interested in items that match 90% there is no reason to finish the last 89% of the algorithm iterations. Something like a plagiarism checker comes to mind.

garfieldnate commented 10 years ago

Oh, oh, you mean I should pass $threshold into the distance sub so it can die early and save time if possible. I'll get on that tonight. Thanks!

garfieldnate commented 10 years ago

Since you say it is more common for metrics to accept the third argument, I think I might just pass a third by default. If someone wants to use a method that will die with a third argument, then they can pass in an anonymous sub that ignores it:

sub {
    my ($a, $b) = @_;
    return two_arg_only_method($a, $b);
}
garfieldnate commented 10 years ago

Wait, I don't think I can apply this optimization. The distance between the current node and the input has to be calculated in order to continue searching the tree. The algorithm does use a threshold to skip the parts of the tree containing objects with too large a distance from the query, but this doesn't allow you to skip calculating the distance at any specific node.

ugexe commented 10 years ago

I see. I'm not familiar with BK Trees, I just figured you could just do something like

# _find
if(defined $distance && $distance <= $threshold){
    push @$current_list, $node->{object};
}

# and possibly in insert
if(defined $dist && $dist == 0){
    return;
}
garfieldnate commented 10 years ago

I can't put a threshold on the distance in insert because the distance is needed to insert the item in the tree. The idea for _find works for selecting whether or not to add the item to the search results, but it does not work for continuing the search down the rest of the tree. Since you've written a lot of related modules, I think you might finding the short articles on BK trees in the documentation to be interesting.