Closed puneetsl closed 9 years ago
Cheers. I do have a few requests before I can merge this though.
Could you remove the superfluous @see tags? Just the one that mentions reference the initial description by ohio-state should be enough.
Can you also remove the @author tags. Authorship of individual files, especially in an open-source project when they're written by many people, is a bit sketchy. You can however put your name+mail into the AUTHORS file.
Finally the Apache reference has to go. This project is licensed under the GPL, its not possible to provide a single file under a different or dual licence.
Thanks for the guidance. I have updated the code according to your suggestions. Hopefully this can be merged now.
You may have forgotten to commit removing the Apache License reference.
Also I'm going over the code right now in detail and I'm noticing the following things
Removing the cells and merging the delete/insert cost would significantly improve TextBrews speed. I can make these changes. Just want to make sure you aren't intending to use the Cells for anything.
Hi, Yes you can go ahead and make those changes, The only reason Cell is used here is to trace the path for the testing purpose. I don't think this feature would be needed by Simmetrics user. I won't mind any changes in the code that you suggested, as long as the test cases are passing, and results in improving the speed. Cheers :beers:
I've found a problem with the normalization. When setting all the costs (match, insert, delete, transpose and substitute) to 2.0 text brew starts producing negative similarity scores. Cause is two fold, costs need to be scaled down such that the largest cost <= 1.0 and scores need to be normalized by the length of the longest of the two input strings.
edit: This means that all the unit tests will break.
Please refer to http://www.slideshare.net/kyleburton/fuzzy-string-matching if you move to page 37, you can see that authors have used 15 as the delete cost. I don't think scaling down would produce the same result. Negative similarity score is fine as I have used this condition:
if(retScore < 0.05)
return 0.0;//for all erroneous cases
else
return retScore;
I have checked with the Perl implementation as well and that gives the same result. You can check my libraries results are consistent with the slide's result in this image.
You can checkout these implementations in other languages, I may have some mistake while adapting or implementing. https://github.com/kyleburton/fuzzy-string/tree/master/other-langs/textBrew
Just checked these test cases: https://raw.githubusercontent.com/kyleburton/fuzzy-string/master/other-langs/textBrew/data/test-cases.tab My implementation gives the right distance but I don't know how the similarity score is calculated in this file. Apparently they do not have any negative similarity. while I was calculating similarity as it is calculated in edit distance but I guess some normalization needs to be done. I need to figure this out.
Mmh. TextBrew seems to have some rather strange edges.
Regarding the examples used on the slides, it is only required that the delete cost is sufficiently large. Because for textBrewDistance(a,b,W) = D
where textBrewDistance is the (unnormalized) weighted distance function a and b are strings, and W the weights, we can scale the weights by any c such that textBrewDistance(a,b,W/c) = D/c
. I'd write down a more concrete proof but I'm a bit short on time.
This distance is normalized by this bit of code:
double len = (left.length()+ right.length())/2.0;
int maxlen = Math.max(left.length(), right.length());
if ((double) len / (double) maxlen <= 0.5)
return 0.0;
double retScore = (1.0 - ((t.computeSimilarity(left,right).cost) / (len)));
if(retScore < 0.05)
return 0.0;//for all erroneous cases
else
return retScore;
Now this normalization seems rather ad-hoc. First of all it is using the average string length against a weighted distance. These are two different units which doesn't yield a percentage score as presented in the slides. A percentage score would require using the maximum possible score.
Additionally the if(retScore < 0.05)
goes off the rails with configurable costs. You can set the cost of a match to 10 and other costs at 100. This means that any results any equal strings with a length less then 10 to be discarded as dissimilar. This breaks the reflexive requirement.
Now these are not the default settings, but they're part of the algorithm and any algorithms that are included should work under all circumstance.
So I would use a proper normalization like:
float max = max(match, insert, delete, substitute, transpose);
this.match = match / max ;
this.insert = insert / max ;
this.delete = delete / max ;
this.substitute = substitute / max ;
this.transpose = transpose / max;
return 1.0f - min(textBrew(a, b), textBrew(b, a)) / max(a.length(), b.length());
But this is definitely different from TextBrew.
found this ruby implementation which I think generate those similarity score using Keybord distance. https://github.com/kyleburton/fuzzy-string/blob/master/edist-app/lib/brew.rb
I don't know if this is the right way to compute similarity scores, but just wanted to bring to your notice.
Gave it some thought. I can't add textBrew but I can add more generalized weighted edit distance editDistance(a,b,W/d)
then given that similarity you can write your own text brew as:
textBrewSimilarity(a,b,W) = 0.0 if avg(|a|,|b|) < 0.5 * max(|a|,|b|)
0.0 if textBrewDistance(a,b,W) > avg(|a|,|b|)
1-textBrewDistance(a,b,W)/ avg(|a|,|b|) otherwise
textBrewDistance(a,b,W) = d*editDistance(a,b,W/d) where d is max(d in W)
Sounds great! But the algorithm you are talking about, is quite similar to TextBrew in fact, if I am not wrong, TextBrew is basically weighted edit distance which also tracks the path. I think weighted edit distance could be a great addition to Simmitrics. About this pull request: So are we not going forward with this code?
Its actually a weighted Damerau–Levenshtein distance. And the path tracking isn't really something new from TextBrew either. It'd be more accurate to say that TextBrew applies some heuristics to the edit distance to make it effective for specific cases.
https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
And yes. I won't use this pull request. Sorry about that. It looked pretty promising.
I've just added Damerau. The distance method isn't public yet. I'm still working on that for a couple of other metrics.
Introduction
Adding a new algorithm TextBrew to Simmetrics.
Feature of this algorithm
With this algorithm, user would be able to match very vague texts as well by playing with the cost parameters of edit distance. This would enable user to match strings like
United States of America
andU S A