datatonic / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

PersonNameComparator: handling of short words #45

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I may be wrong but it seems the PersonNameComparator has a couple of bugs:

1. The execution does not really reach the else if responsible for short tokens 
handling:
line 88
} else if (t1[ix].length() + t2[ix].length() <= 4)
          // it's not an initial, so if the strings are 4 characters
          // or less, we quadruple the edit dist
          d = d * 4;
        else

2. In line 72, t1.length needs to be t2.length? As t1 is always the longer 
token.
} else if (d > 1 && (ix + 1) <= t1.length)

What do you think?

Original issue reported on code.google.com by ale...@yoxel.com on 28 Oct 2011 at 12:06

GoogleCodeExporter commented 8 years ago
You are quite right. I started on the PersonNameComparator, but never actually 
finished it, because I lost faith in it. It seemed to become a mass of special 
cases with no real justification or overarching theory. In practice, it was 
superseded by JaroWinkler and Levenshtein, which I think do a better job.

I'm tempted to just delete the whole class, and instead focus on 
industry-standard comparators.

What do you think?

Original comment by lar...@gmail.com on 28 Oct 2011 at 6:40

GoogleCodeExporter commented 8 years ago
 I am hitting a few cases on name comparisons where the difference is just in first names and those are short names and I need to be more strict there. I agree, I liked JaroWinklerTokenized most, it worked pretty well with full names but I am trying to make my matching more strict and now switched to comparing first and last names separately and on short first names it is too optimistic, often giving me positive answer where I need no-match, i.e. Dave vs. Dale. Leventstein is not working that well either for those short names as the distance is 1 or 2. What I liked about the PersonalNameComparator is that it still used Levenstein but adjusted the metric for short terms and that seems to address my challenge.

 Do you have any better suggestions? I don't want to switch to exact matching but would like to make either Levenstein or JaroWinkler more strict on short terms.

Thanks

Original comment by yoxel.sy...@gmail.com on 28 Oct 2011 at 7:14

GoogleCodeExporter commented 8 years ago
Actually I'll give you a few examples:

Levenstein:

Alex vs. Alexey  - 0.55 (out of 0.6), I like this
Paul vs. Mr. Paul - no-match, I'd like to a match here but it is ok, I probably 
need to clean the Mr.
Joseph vs. John D. - no-match, I would not mind a match here, but it is good
Jessica vs. Jessica Redmond - no-match, I would not mind a match, but it is good
Sam vs. Samuel - no-match, I would prefer a match here
Dale vs. Dawn - 0.55 (out of 0.6), almost an exact match and I don't like this!
Solman vs. Lonnette - no match, here it is doing what I want
coakley iii vs. coakley - no match, I would prefer some match

JiroWinklerTokenized:

Alex vs. Alexey  - 0.59, I like this but too high
Paul vs. Mr. Paul - 0.6, I like it but too high (non-tokenized give no-match)
Joseph vs. John D. - 0.57, I like this but too high on the last names of these 
people (Cassata vs. Cannon) 0.59 out of 0.65
Jessica vs. Jessica Redmond - 0.6, I guess due to tokenizing
Sam vs. Samuel - 0.59, I like this
James vs. Sue - 0.55, seems too high! Leventstein was better - no match
judy vs. Jim - 0.56, too high for my needs! Leventstein was better - no match
Dale vs. Dawn - 0.57 , too high for my needs! Leventstein had the same issue !!!
Solman vs. Lonnette - 0.55, I'd like no-match here. Leventstein was better - no 
match
wayne vs. claude - 0.56, seems too high! Leventstein was better - no match
john vs. jake - 0.55. maybe ok?
brooks vs. b - 0.61 (out of 0.65) I kind of liked that (b - initial) but this 
may not be good on many other cases

 So what I see is that JaroWinkler is too optimistic for me. I Levenstein better but I'd like it to be more pessimistic on short names. Basically if the length of the term could be weighed in that could help.

 Then I either need to clean the names to remove various Mr., iii, ... or have TokenizedLevenstein which would automatically take care of the uneven number of tokens, even Jessica vs. Jessica Redmond <-middle name.

 Any ideas on how I could achieve that? Maybe I just use the adjustments that you put in PersonalNameC-or.
I also liked that you tried to match startsWith firstName and initials there. 
All makes sense to me.

Thanks

Original comment by yoxel.sy...@gmail.com on 28 Oct 2011 at 7:53

GoogleCodeExporter commented 8 years ago
From the Levenshtein point of view:

> Sam vs. Samuel - no-match, I would prefer a match here
> Dale vs. Dawn - 0.55 (out of 0.6), almost an exact match and I don't like this

the former is a 3 out of 3 difference, whereas the latter is a 2 out of 4 
difference. No wonder that Levenshtein prefers the latter.

What you can do is to teach the PersonNameComparator that Sam is a common 
contraction for Samuel. We could try to build up a list of such common 
correspondences.

Another thing you can do is to modify Levenshtein so that it does become more 
demanding on short strings. Shouldn't be hard to either put in a hard limit or 
modify the formula. Just make your own subclass and do it. The actual edit 
distance comparison is a static method you can call from your own comparator.

> Then I either need to clean the names 

You definitely need to clean the names. Handling this kind of thing in the 
comparators is both wrong and slow.

I hope this helps.

Original comment by lar...@gmail.com on 31 Oct 2011 at 9:43

GoogleCodeExporter commented 8 years ago
 I created my LevenshteinTokenized comparator (tokenized like in JaroWinklerTokenized) with adjustments for short terms and term1 startsWith term2 (like in your PersonNameComparator). Works like a charm :)
I don't have to do cleaning really because the tokenization takes care of that.
Thanks

Original comment by ale...@yoxel.com on 31 Oct 2011 at 3:57

GoogleCodeExporter commented 8 years ago
Very good. I assume this means the issue is solved.

Original comment by lar...@gmail.com on 4 Nov 2011 at 10:16

GoogleCodeExporter commented 8 years ago
Yes, thank you.

Original comment by ale...@yoxel.com on 4 Nov 2011 at 4:16