Closed puneetsl closed 9 years ago
Cheers. Am certainly interested. Going to check this out. Meanwhile perhaps you could you answer a few questions.
compare(a,b)==compare(b,a)
), reflexives (e.g. compare(a,a)==1.0
) and consistent with equals? (e.g. a.equals(b) implies compare(a,b) == 1.0
). Are similarity scores normalized between 0.0 and 1.0 inclusive?Thanks a lot! Answering your questions here:
Text Brew is
Yes, it would be consistent with equals. Yes, the similarity scores are normalized between 0.0 and 1.0.
Cool. Looking forward to the PR then.
Regarding the lack of symetry, that could be solved by using max(compare(a,b), compare(b,a)
as done in TextBrew.compareAndGiveBestScore
for the implementation of StringMetric.compare
. Or perhaps more efficiently by swapping the arguments as needed based on the relation of the add/remove costs.
Short summary of outcome.
On closer inspection TextBrew turned out to be a heuristic applied to [Damerau-Levenshtein](Damerau–Levenshtein distance). Damerau-Levenshtein has been added to SimMetrics.
Hi,
Can I create a pull request to include TextBrew string matching algorithm which is quite similar to Edit Distance but works quite well for matching abbreviations or short-hands for words (http://www.ling.ohio-state.edu/~cbrew/795M/string-distance.html).
Here is the implementation: https://github.com/puneetsl/jtextbrew You could see a few test cases in the test package.
Thanks, Puneet