Closed GoogleCodeExporter closed 8 years ago
You are quite right. I started on the PersonNameComparator, but never actually
finished it, because I lost faith in it. It seemed to become a mass of special
cases with no real justification or overarching theory. In practice, it was
superseded by JaroWinkler and Levenshtein, which I think do a better job.
I'm tempted to just delete the whole class, and instead focus on
industry-standard comparators.
What do you think?
Original comment by lar...@gmail.com
on 28 Oct 2011 at 6:40
I am hitting a few cases on name comparisons where the difference is just in first names and those are short names and I need to be more strict there. I agree, I liked JaroWinklerTokenized most, it worked pretty well with full names but I am trying to make my matching more strict and now switched to comparing first and last names separately and on short first names it is too optimistic, often giving me positive answer where I need no-match, i.e. Dave vs. Dale. Leventstein is not working that well either for those short names as the distance is 1 or 2. What I liked about the PersonalNameComparator is that it still used Levenstein but adjusted the metric for short terms and that seems to address my challenge.
Do you have any better suggestions? I don't want to switch to exact matching but would like to make either Levenstein or JaroWinkler more strict on short terms.
Thanks
Original comment by yoxel.sy...@gmail.com
on 28 Oct 2011 at 7:14
Actually I'll give you a few examples:
Levenstein:
Alex vs. Alexey - 0.55 (out of 0.6), I like this
Paul vs. Mr. Paul - no-match, I'd like to a match here but it is ok, I probably
need to clean the Mr.
Joseph vs. John D. - no-match, I would not mind a match here, but it is good
Jessica vs. Jessica Redmond - no-match, I would not mind a match, but it is good
Sam vs. Samuel - no-match, I would prefer a match here
Dale vs. Dawn - 0.55 (out of 0.6), almost an exact match and I don't like this!
Solman vs. Lonnette - no match, here it is doing what I want
coakley iii vs. coakley - no match, I would prefer some match
JiroWinklerTokenized:
Alex vs. Alexey - 0.59, I like this but too high
Paul vs. Mr. Paul - 0.6, I like it but too high (non-tokenized give no-match)
Joseph vs. John D. - 0.57, I like this but too high on the last names of these
people (Cassata vs. Cannon) 0.59 out of 0.65
Jessica vs. Jessica Redmond - 0.6, I guess due to tokenizing
Sam vs. Samuel - 0.59, I like this
James vs. Sue - 0.55, seems too high! Leventstein was better - no match
judy vs. Jim - 0.56, too high for my needs! Leventstein was better - no match
Dale vs. Dawn - 0.57 , too high for my needs! Leventstein had the same issue !!!
Solman vs. Lonnette - 0.55, I'd like no-match here. Leventstein was better - no
match
wayne vs. claude - 0.56, seems too high! Leventstein was better - no match
john vs. jake - 0.55. maybe ok?
brooks vs. b - 0.61 (out of 0.65) I kind of liked that (b - initial) but this
may not be good on many other cases
So what I see is that JaroWinkler is too optimistic for me. I Levenstein better but I'd like it to be more pessimistic on short names. Basically if the length of the term could be weighed in that could help.
Then I either need to clean the names to remove various Mr., iii, ... or have TokenizedLevenstein which would automatically take care of the uneven number of tokens, even Jessica vs. Jessica Redmond <-middle name.
Any ideas on how I could achieve that? Maybe I just use the adjustments that you put in PersonalNameC-or.
I also liked that you tried to match startsWith firstName and initials there.
All makes sense to me.
Thanks
Original comment by yoxel.sy...@gmail.com
on 28 Oct 2011 at 7:53
From the Levenshtein point of view:
> Sam vs. Samuel - no-match, I would prefer a match here
> Dale vs. Dawn - 0.55 (out of 0.6), almost an exact match and I don't like this
the former is a 3 out of 3 difference, whereas the latter is a 2 out of 4
difference. No wonder that Levenshtein prefers the latter.
What you can do is to teach the PersonNameComparator that Sam is a common
contraction for Samuel. We could try to build up a list of such common
correspondences.
Another thing you can do is to modify Levenshtein so that it does become more
demanding on short strings. Shouldn't be hard to either put in a hard limit or
modify the formula. Just make your own subclass and do it. The actual edit
distance comparison is a static method you can call from your own comparator.
> Then I either need to clean the names
You definitely need to clean the names. Handling this kind of thing in the
comparators is both wrong and slow.
I hope this helps.
Original comment by lar...@gmail.com
on 31 Oct 2011 at 9:43
I created my LevenshteinTokenized comparator (tokenized like in JaroWinklerTokenized) with adjustments for short terms and term1 startsWith term2 (like in your PersonNameComparator). Works like a charm :)
I don't have to do cleaning really because the tokenization takes care of that.
Thanks
Original comment by ale...@yoxel.com
on 31 Oct 2011 at 3:57
Very good. I assume this means the issue is solved.
Original comment by lar...@gmail.com
on 4 Nov 2011 at 10:16
Yes, thank you.
Original comment by ale...@yoxel.com
on 4 Nov 2011 at 4:16
Original issue reported on code.google.com by
ale...@yoxel.com
on 28 Oct 2011 at 12:06