Simmetrics / simmetrics

Similarity or Distance Metrics, e.g. Levenshtein, for Java
Apache License 2.0
340 stars 77 forks source link

Including TextBrew #16

Closed puneetsl closed 9 years ago

puneetsl commented 9 years ago

Hi,

Can I create a pull request to include TextBrew string matching algorithm which is quite similar to Edit Distance but works quite well for matching abbreviations or short-hands for words (http://www.ling.ohio-state.edu/~cbrew/795M/string-distance.html).

Here is the implementation: https://github.com/puneetsl/jtextbrew You could see a few test cases in the test package.

Thanks, Puneet

mpkorstanje commented 9 years ago

Cheers. Am certainly interested. Going to check this out. Meanwhile perhaps you could you answer a few questions.

  1. Is text brew both symmetric (e.g. compare(a,b)==compare(b,a)), reflexives (e.g. compare(a,a)==1.0) and consistent with equals? (e.g. a.equals(b) implies compare(a,b) == 1.0). Are similarity scores normalized between 0.0 and 1.0 inclusive?
  2. I had a quick look at the code and while it looks good I do see that some structural changes are required. The current implementation uses static methods and static-state while all the Metrics in SimMetrics implement either the String-, Set- or ListMetric are non-static. To make these compatible are you going to maintain two different versions of JTextBrew or only develop the SimMetrics version? I'd like to avoid duplicated effort when possible (though I don't think that will be possible right now).
  3. Will you use Maven?
puneetsl commented 9 years ago

Thanks a lot! Answering your questions here:

  1. Text Brew is

    • Not symmetric
    • Reflexive

    Yes, it would be consistent with equals. Yes, the similarity scores are normalized between 0.0 and 1.0.

  2. I think I would not make much changes in jTextBrew once I'll port it to Simmitrics, and the pull request gets accepted. I may however like to fix any bugs (if found) in the original code. I'll also follow the architechture Simmitrics is using to port it, so, I'll maintain just the SimMetrics version.
  3. I'll use Maven
mpkorstanje commented 9 years ago

Cool. Looking forward to the PR then.

Regarding the lack of symetry, that could be solved by using max(compare(a,b), compare(b,a) as done in TextBrew.compareAndGiveBestScore for the implementation of StringMetric.compare. Or perhaps more efficiently by swapping the arguments as needed based on the relation of the add/remove costs.

mpkorstanje commented 9 years ago

Short summary of outcome.

On closer inspection TextBrew turned out to be a heuristic applied to [Damerau-Levenshtein](Damerau–Levenshtein distance). Damerau-Levenshtein has been added to SimMetrics.