Simmetrics / simmetrics

Similarity or Distance Metrics, e.g. Levenshtein, for Java
Apache License 2.0
340 stars 77 forks source link

Add TextBrew algorithm to Simmetrics #17

Closed puneetsl closed 9 years ago

puneetsl commented 9 years ago

Introduction

Adding a new algorithm TextBrew to Simmetrics.

Feature of this algorithm

With this algorithm, user would be able to match very vague texts as well by playing with the cost parameters of edit distance. This would enable user to match strings like United States of America and U S A

mpkorstanje commented 9 years ago

Cheers. I do have a few requests before I can merge this though.

Could you remove the superfluous @see tags? Just the one that mentions reference the initial description by ohio-state should be enough.

Can you also remove the @author tags. Authorship of individual files, especially in an open-source project when they're written by many people, is a bit sketchy. You can however put your name+mail into the AUTHORS file.

Finally the Apache reference has to go. This project is licensed under the GPL, its not possible to provide a single file under a different or dual licence.

puneetsl commented 9 years ago

Thanks for the guidance. I have updated the code according to your suggestions. Hopefully this can be merged now.

mpkorstanje commented 9 years ago

You may have forgotten to commit removing the Apache License reference.

mpkorstanje commented 9 years ago

Also I'm going over the code right now in detail and I'm noticing the following things

  1. The Cell objects aren't needed. The only field that can actually be conveyed to users of TextBrew is cost. So the matrix can be a matrix of floats rather then cells. This also means that we don't need to use a matrix of n_m but can do with a matrix of 3_n and swap arrays as we go along (see the levenstein implementation).
  2. The start cost is never used. It is put into matrix[0][0] which is never read from.
  3. If we're always taking the best result of computeSimilarity(a, b) and computeSimilarity(b,a) to use in compare(a,b), then having separate insert and deletion cost aren't needed. We'll always use the smaller value.

Removing the cells and merging the delete/insert cost would significantly improve TextBrews speed. I can make these changes. Just want to make sure you aren't intending to use the Cells for anything.

puneetsl commented 9 years ago

Hi, Yes you can go ahead and make those changes, The only reason Cell is used here is to trace the path for the testing purpose. I don't think this feature would be needed by Simmetrics user. I won't mind any changes in the code that you suggested, as long as the test cases are passing, and results in improving the speed. Cheers :beers:

mpkorstanje commented 9 years ago

I've found a problem with the normalization. When setting all the costs (match, insert, delete, transpose and substitute) to 2.0 text brew starts producing negative similarity scores. Cause is two fold, costs need to be scaled down such that the largest cost <= 1.0 and scores need to be normalized by the length of the longest of the two input strings.

edit: This means that all the unit tests will break.

puneetsl commented 9 years ago

Please refer to http://www.slideshare.net/kyleburton/fuzzy-string-matching if you move to page 37, you can see that authors have used 15 as the delete cost. I don't think scaling down would produce the same result. Negative similarity score is fine as I have used this condition:

if(retScore < 0.05) 
    return 0.0;//for all erroneous cases
else
    return retScore;

I have checked with the Perl implementation as well and that gives the same result. You can check my libraries results are consistent with the slide's result in this image. textbrew

You can checkout these implementations in other languages, I may have some mistake while adapting or implementing. https://github.com/kyleburton/fuzzy-string/tree/master/other-langs/textBrew

puneetsl commented 9 years ago

Just checked these test cases: https://raw.githubusercontent.com/kyleburton/fuzzy-string/master/other-langs/textBrew/data/test-cases.tab My implementation gives the right distance but I don't know how the similarity score is calculated in this file. Apparently they do not have any negative similarity. while I was calculating similarity as it is calculated in edit distance but I guess some normalization needs to be done. I need to figure this out.

mpkorstanje commented 9 years ago

Mmh. TextBrew seems to have some rather strange edges.

Regarding the examples used on the slides, it is only required that the delete cost is sufficiently large. Because for textBrewDistance(a,b,W) = D where textBrewDistance is the (unnormalized) weighted distance function a and b are strings, and W the weights, we can scale the weights by any c such that textBrewDistance(a,b,W/c) = D/c. I'd write down a more concrete proof but I'm a bit short on time.

This distance is normalized by this bit of code:

        double len = (left.length()+ right.length())/2.0;
        int maxlen = Math.max(left.length(), right.length());

        if ((double) len / (double) maxlen <= 0.5)
            return 0.0;

        double retScore = (1.0 - ((t.computeSimilarity(left,right).cost) / (len)));
        if(retScore < 0.05) 
            return 0.0;//for all erroneous cases
        else
            return retScore;

Now this normalization seems rather ad-hoc. First of all it is using the average string length against a weighted distance. These are two different units which doesn't yield a percentage score as presented in the slides. A percentage score would require using the maximum possible score.

Additionally the if(retScore < 0.05) goes off the rails with configurable costs. You can set the cost of a match to 10 and other costs at 100. This means that any results any equal strings with a length less then 10 to be discarded as dissimilar. This breaks the reflexive requirement.

Now these are not the default settings, but they're part of the algorithm and any algorithms that are included should work under all circumstance.

So I would use a proper normalization like:

float max = max(match, insert, delete, substitute, transpose);
this.match = match / max ;
this.insert = insert / max ;
this.delete = delete / max ;
this.substitute = substitute / max ;
this.transpose = transpose / max;

return 1.0f - min(textBrew(a, b), textBrew(b, a)) / max(a.length(), b.length());

But this is definitely different from TextBrew.

puneetsl commented 9 years ago

found this ruby implementation which I think generate those similarity score using Keybord distance. https://github.com/kyleburton/fuzzy-string/blob/master/edist-app/lib/brew.rb

I don't know if this is the right way to compute similarity scores, but just wanted to bring to your notice.

mpkorstanje commented 9 years ago

Gave it some thought. I can't add textBrew but I can add more generalized weighted edit distance editDistance(a,b,W/d) then given that similarity you can write your own text brew as:

textBrewSimilarity(a,b,W) = 0.0 if avg(|a|,|b|) < 0.5 * max(|a|,|b|)
                            0.0 if textBrewDistance(a,b,W) > avg(|a|,|b|)
                            1-textBrewDistance(a,b,W)/ avg(|a|,|b|) otherwise

textBrewDistance(a,b,W) = d*editDistance(a,b,W/d) where d is max(d in W)
puneetsl commented 9 years ago

Sounds great! But the algorithm you are talking about, is quite similar to TextBrew in fact, if I am not wrong, TextBrew is basically weighted edit distance which also tracks the path. I think weighted edit distance could be a great addition to Simmitrics. About this pull request: So are we not going forward with this code?

mpkorstanje commented 9 years ago

Its actually a weighted Damerau–Levenshtein distance. And the path tracking isn't really something new from TextBrew either. It'd be more accurate to say that TextBrew applies some heuristics to the edit distance to make it effective for specific cases.

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

And yes. I won't use this pull request. Sorry about that. It looked pretty promising.

mpkorstanje commented 9 years ago

I've just added Damerau. The distance method isn't public yet. I'm still working on that for a couple of other metrics.