Closed ugexe closed 10 years ago
I'm sorry, I don't quite understand the issue. Line 129 calls Text::Levenshtein::XS, or whatever metric subroutine was supplied, without the max_distance
parameter, so it should never return undef
, right?
Correct. However if you 'did' pass a max_distance parameter it could be much faster depending on the input. If the user is not interested in strings beyond a certain max distance then the algorithm can bail out and not finish its calculations. With how you currently have it those results get filtered 'but' the algorithm is still calculating the entire thing. For most things this is probably fine, but for larger strings it might be wise to use. If you have 'a' x 2100 and 'b' x 2100 and are only interested in items that match 90% there is no reason to finish the last 89% of the algorithm iterations. Something like a plagiarism checker comes to mind.
Oh, oh, you mean I should pass $threshold
into the distance sub so it can die early and save time if possible. I'll get on that tonight. Thanks!
Since you say it is more common for metrics to accept the third argument, I think I might just pass a third by default. If someone wants to use a method that will die with a third argument, then they can pass in an anonymous sub that ignores it:
sub {
my ($a, $b) = @_;
return two_arg_only_method($a, $b);
}
Wait, I don't think I can apply this optimization. The distance between the current node and the input has to be calculated in order to continue searching the tree. The algorithm does use a threshold to skip the parts of the tree containing objects with too large a distance from the query, but this doesn't allow you to skip calculating the distance at any specific node.
I see. I'm not familiar with BK Trees, I just figured you could just do something like
# _find
if(defined $distance && $distance <= $threshold){
push @$current_list, $node->{object};
}
# and possibly in insert
if(defined $dist && $dist == 0){
return;
}
I can't put a threshold on the distance in insert
because the distance is needed to insert the item in the tree. The idea for _find
works for selecting whether or not to add the item to the search results, but it does not work for continuing the search down the rest of the tree.
Since you've written a lot of related modules, I think you might finding the short articles on BK trees in the documentation to be interesting.
https://github.com/garfieldnate/p5-Tree-BK/blob/master/lib/Tree/BK.pm#L129
If you pass Text::Levenshtein::XS $threshold for its 3rd argument it will stop calculations and return undef if the threshold is exceeded. This saves time with longer strings, although for most words its probably negligible.
You would need to add something to your new's $metric assignment in case someone supplies a module that does not have this max distance functionality. Something like: { eval { supplied_func("aa","bb",1) }; $@ ? (use 2 argument with no max distance) : (use 3 argument with max distance) }
All the distance metric modules i've worked on support this. Text::LevenshteinXS does not, but no one should be using that module as it does not work correctly.